2013年11月29日 星期五

ARMv8 64bit Architecture

From: http://www.quepublishing.com/articles/article.aspx?p=1843892&seqNum=5

The definition of a 64-bit architecture is a bit fuzzy. Typically either supporting 64-bit pointers or having 64-bit registers is considered a requirement. In AArch64 mode, ARMv8 provides both.

ARMv8 increases the size of the register set to 31 64-bit registers.

In most of the places where you find ARM chips, 64 bits isn't very useful. A mobile phone or a tablet, for example, doesn't usually run anything that would benefit from a 64-bit address space. The situation with ARM is quite different from x86, where the 64-bit transition brought the opportunity to clean up a lot of warts in the architecture. It's fairly common to see an x86 program run 10[nd]20% faster when compiled for x86-64, because the 64-bit mode also brings with it advantages like these:
  • Program-counter relative addressing—important for position-independent code in libraries
  • More registers—reduce the need for register-to-stack copies
  • Guarantee of SSE—so the compiler doesn't have to emit x87 code for floating point
On architectures like MIPS, PowerPC, or SPARC, compiling in 64-bit mode can often make things slower, because the only significant difference is that you're using twice as much cache space for storing pointers. Unless you're using a lot of 64-bit integer arithmetic, there's little advantage.

However, some markets that interest ARM would benefit from a 64-bit architecture. One of the biggest growth areas at the moment is in very low-power servers. ARM's current offering here is the Cortex A8, which supports LPAE. This allows the operating system to use a 40-bit physical address space (up to 1TB), but only permits applications to use a 32-bit virtual address space. For things like databases and even web servers that want to cache as much as possible in memory, this situation is less than ideal.

The address space, in common with most "64-bit" platforms, is actually not quite 64 bits. Pointers are 64 bits, but some bits are ignored. As with x86-64 (and SPARC64, and so on) there's a big hole in the middle of the address space, with the bottom (addresses up to 0x0000FFFFFFFFFFFFFFFF) reserved for userspace, and an equal-sized reservation at the top for the kernel.

There are several interesting things about how this design is implemented. One is that the top byte of the address is completely ignored when doing the virtual-to-physical mapping. This means that they can be used to implement tagged pointers. There are a few really interesting things that you can do with these pointers, especially in object-oriented languages. The typical way of allocating an object is to use the first word to contain a pointer to its class (or vtable, in C++). For small objects, the class pointer can end up being a lot of the object's total space.
This is especially true for languages like JavaScript, where even numbers are objects. A naïve implementation would require 128 bits to store a 32-bit integer (96 bits, but rounded up to 128 by malloc). Using a tagged pointer lets you store the integer inside the pointer value, so you don't need any allocations.
With 8 bits, you can define 255 classes that can omit the class pointer from their instances. This significantly reduces the total memory usage; more importantly, it can significantly reduce the cache usage.

As you might expect from a modern architecture, ARMv8 is designed to support virtualization. Translation from virtual to physical memory addresses can be quite complex. This mapping can be quite slow, but ARMv8 does a few things to make this process simpler. One is to allow 64KB pages.
It allows a malloc() implementation to get a reasonable amount of memory from the kernel cheaply without increasing internal fragmentation too badly, although it's probably more likely to be used for the hypervisor's page tables.

About code size -- the instruction length
32-bit ARM programs typically use one of two encodings: ARM or Thumb-2. ARM is also a 32-bit encoding. Thumb-2 is a variable-length encoding, with the most common instructions being encoded in 16 bits.
This means that 64-bit code is likely to be larger, because AArch64 only use A64. Unfortunately, I expect the opposite. The 64-bit architecture is removing two of the most useful features of the instruction set for compressing code:
  • Predicated instructions. Most ARM instructions include a condition field mask, and they'll execute only when these conditions are met. This means that ARM code needs fewer branch instructions, and for a long time ARM chips could achieve good performance without a branch predictor. Modern ARM chips include branch prediction, so this feature is somewhat less useful, and its cost in terms of complexity (which turns into power consumption) was deemed too high to justify it.
  • Load and store multiple instructions. These instructions allow loading and storing an arbitrary subset of the register set—great for compilers and assembly programmers. A function prologue and epilogue just need to contain one instruction store each that any callee-save registers that the function modifies, reloading them later. Similarly, a call instruction just needs to be bracketed by a single instruction on each side to preserve any of the caller-save registers it cares about.
Although these instructions are useful (and great for producing dense code), they're fantastically complex to implement. With the enlarged register set, there should be less of a requirement to save and load registers, so hopefully these instructions aren't needed as much.

About Floating Point Computation
ARMv8 improves this support dramatically. The floating-point register set is now 32 registers, each of which is 128 bits wide, allowing it to store four single-precision or two double-precision values. The architecture now fully supports the IEEE 754 standard for floating point, including all of the strange rounding modes and not-a-number values (for example, the result of division by zero) that the specification requires.

Cryptographic Support
A modern server has to do a lot of encryption and decryption. Most network connections want to be encrypted, and it's increasingly common to encrypt the contents of the disk as a precaution against theft. Being able to implement common encryption algorithms efficiently is important, but for the best performance and power usage it's even better to have them implemented in hardware.
Recent AMD and Intel chips provide custom instructions for implementing AES encryption. This design reduces the CPU cost of AES encryption and decryption to around a tenth of its cost in a pure software implementation.
ARMv8 goes one step further, providing SHA-1 and SHA-256 instructions.

Mix 32-bit Code
There is no equivalent for mixing 32-bit and 64-bit code. You can run both on the same chip, but they must be in separate processes. There is also no Thumb-3 giving a shorter encoding for common 64-bit operations, although I wouldn't be surprised if this appears in a future revision once they've had more time to work out exactly which subset of the AArch64 instruction set compilers like to generate.

The Future
The built-in hypervisor support will be very popular for mobile phones, making it easy to implement a small real-time hypervisor and a lightweight (probably 32-bit) OS to control the radio and a heavier general-purpose OS for the user interface.


To be figure out:
  • Tagged Pointer
  • 64KB Page
  • Predicated Instruction

0 意見:

張貼留言