Vector processor

A vector processor, or array processor, is a CPU design that is able to run mathematical operations on a large number of data elements very quickly. This is in contrast to a scalar processor which handles one element at a time – the vast majority of CPUs are scalar (or close to it). Vector processors were common in the scientific computing area, where they formed the basis of most supercomputers through the 1980s and into the 1990s, but general increases in performance and processor design saw the near disappearance of the vector processor. IBM, Toshiba and Sony recently announced the cell chip which consists in part of several vector processors.

Today almost all commodity CPU designs include some vector processing instructions, typically known as SIMD.

Contents

1 History

2 Description

3 See also

4 External links

History

Vector processing was first worked on in the early 1960s at Westinghouse in their Solomon project. Solomon's goal was to dramatically increase math performance by using a large number of simple math co-processors (or ALUs) under the control of a single master CPU. The CPU fed a single common instruction to all of the ALUs, one per "cycle", but with a different data point for each one to work on. This allowed the Solomon machine to apply a single algorithm to a large data set, fed in the form of an array. In 1962 Westinghouse cancelled the project, but the effort was re-started at the University of Illinois as the ILLIAC IV. Their version of the design originally called for a 1GFLOP machine with 256 ALUs, but when it was finally delivered in 1972 it had only 64 ALUs and could reach only 150MFLOPs. Nevertheless it showed that the basic concept was sound, and when used on data-intensive applications, such as computational fluid dynamics, the "failed" ILLIAC was the fastest machine in the world.

The first successful implementation of a vector processor appears to be the CDC STAR-100 and the Texas Instruments Advanced Scientific Computer. The ASC used a single ALU with four instruction pipelines, each able to run on a separate piece of data to allow vectors of 4 elements to be processed at a time. This was a fairly small number even then, but the ASC had better than normal memory throughput to make up for some of this. The STAR was otherwise slower than CDC's own supercomputers like the CDC 7600, but at data related tasks they could keep up while being much smaller and less expensive. However the machine also took considerable time decoding the vector instructions and getting ready to run the process, so it required very specific data sets to work on before it actually sped anything up.

The technique was first fully exploited in the famous Cray-1. Instead of leaving the data in memory like the STAR and ASC, the Cray design had eight "vector registers" which held sixty-four 64-bit words each. The vector instructions were applied between registers, which is much faster than talking to main memory. In addition the design had completely separate pipelines for different instructions, for example, addition was implemented in different hardware than subtraction. This allowed a batch of vector instructions themselves to be pipelined, a technique they called vector chaining. The Cray-1 normally had a performance of about 80 MFLOPs, but with up to three chains running it could peak at 240 MFLOPs – a respectable number even today.

Other examples followed. CDC tried once again with its ETA-10 machine, but it sold poorly and they took that as an opportunity to leave the supercomputing field entirely. Various Japanese companies (Fujitsu, Hitachi and NEC) introduced register-based vector machines similar to the Cray-1, typically being slightly faster and much smaller. Oregon-based Floating Point Systems (FPS) built add-on array processors for minicomputers, later building their own minisupercomputers. However Cray continued to be the performance leader, continually besting the competition with a series of machines that led to the Cray-2, Cray X-MP and Cray Y-MP. Since then the supercomputer market has focussed much more on massively parallel processing rather than better implementations of vector processors.

Today the average computer at home crunches as much data watching a short QuickTime video as did all of the supercomputers in the 1970s. Vector processor elements have since been added to almost all modern CPU designs, although they are typically referred to as SIMD. In these implementations the vector processor runs beside the main scalar CPU, and is fed data from programs that know it's there.

Description

In general terms, CPUs are able to manipulate one or two pieces of data at a time. For instance, every CPU has an instruction that essentially says "add A to B and put the result in C".

The data for A, B and C could be - in theory at least - encoded directly into the instruction. However things are never that simple. In fact the data is rarely sent in raw form, and is almost always "pointed to" by passing in an address to a memory location that holds the data. Decoding this address and getting the data out of the memory takes some time, a time delay that has historically grown more annoying as CPU speeds have increased.

In order to reduce the amount of time this takes, most modern CPUs use a technique known as instruction pipelining in which the instructions pass though several sub-units in turn. The first sub-unit reads the address and decodes it, the next gets the values, and the next does the math. With pipelining the "trick" is to start decoding the next instruction even before the first has left the CPU, in assembly line fashion, so the address decoder is constantly in use. Any particular instruction takes the same amount of time to complete, a time known as the latency, but the CPU can process the entire batch much faster than if it did so one at a time.

Vector processors take this concept one step further. Instead of pipelining just the instructions, they also pipeline the data itself. They are fed instructions that say not just to add A to B, but to add all of the numbers "from here to here" to all of the numbers "from there to there". Instead of constantly having to decode instructions and then fetch the data needed to complete them, it reads a single instruction from memory, and "knows" that the next address will be one larger than the last. This allows for significant savings in decoding time.

To illustrate what a difference this can make, consider the simple task of adding two groups of 10 numbers together. In a normal programming language you would write a "loop" that picked up each of the pairs of numbers in turn, and then added them. To the CPU, this would look something like this...

read the next instruction and decode it get this number get that number add them put the result here read the next instruction and decode it get this number get that number [and so on]

But to a vector processor, this task looks considerably different:

get the 10 numbers here and add them to the numbers there

Completing that single instruction may take longer than the simple add-two-numbers instruction in the general purpose CPU. However this single instruction represents many instructions from the other CPU, so not only can it skip all of those address decodes, but it also has only a single instruction to decode as well. Since the instructions are also stored in memory, and memory is typically very slow compared to the CPU, this technique dramatically improves overall performance by allowing the data set to be read from memory as fast as possible.

But more than that, the vector processor typically has some form of superscalar implementation, meaning there isn't one part of the CPU adding up those 10 numbers, but perhaps two or four of them. Since the output of a vector command does not rely on the input from any other, those two (for instance) parts can each add 5 of the numbers, thereby completing the whole operation in half the time.

Not all problems can be attacked with this sort of solution. Adding these sorts of instructions adds complexity to the core CPU, which typically suffers in more mundane parts of its performance – ie, whenever it's not adding up 10 numbers in a row. The more complex instructions also adds to the complexity of the decoders, which might slow down the decoding of the more common instructions like if.

In fact they work best only when you have large amounts of data to work on. This is why these sorts of CPUs were found primarily in supercomputers, as the supercomputers themselves were found in places like weather prediction and physics labs, where huge amounts of data exactly like this is "crunched".

The NEC SX-6 supercomputer architecture is a NUMA architecture built out of SMP machines with 8 vector processors each.

External links

The History of the Development of Parallel Computing

Categories: Parallel computing

Last updated: 05-07-2005 13:47:13

Last updated: 05-13-2005 07:56:04

Encyclopedia

Dictionary

Quotes

Vector processor

History

Description

See also

External links

The Online Encyclopedia and Dictionary

Encyclopedia

Dictionary

Quotes

Vector processor

History

Description

See also

External links