Online Encyclopedia

Categories: Computer arithmetic | Numeration

Computer numbering formats

One of the common misunderstandings among computer users is a certain faith in the infallibility of numerical computations. That is, if you multiply, say:

$3 \times \frac{1}{3}$

you might expect to get a result of exactly 1. In practice, the result may prove to be something such as 0.9999999999999999 (as one might find when doing the calculation on paper) or, in certain cases, perhaps 0.99999999923475.

The latter result seems to indicate a bug in the system, and it is a shock to find out that that is the way it happens to work if you use a binary floating-point representation. Decimal floating-point, computer algebra systems, and certain bignum systems might give either the answer of 1 or 0.999...

Contents

1 Bits, bytes, nybbles, and unsigned integers

1.1 Why binary?

2 Octal and hex number encoding

2.1 Converting between bases

3 Representing signed integers in binary

3.1 Sign and magnitude
3.2 One's complement
3.3 Two's complement

4 Representing fractions in binary

4.1 Fixed-point numbers
4.2 Floating-point numbers

5 Numbers in programming languages

6 Text are numbers: ASCII and strings

7 See also

Bits, bytes, nybbles, and unsigned integers

Almost all computer users understand the concept of a bit (that is, a 1 or 0 value encoded by the setting of a switch of some kind). A single bit can represent two states:

0 1

Therefore, if you take two bits, you can use them to represent four unique states:

   00 01 10 11

And, if you have three bits, then you can use then to represent eight unique states:

   000 001 010 011 100 101 110 111

With every bit you add, you double the number of states you can represent. Therefore, the expression for the number of states with n bits is 2ⁿ. Most computers operate on information in groups of 8 bits, or some other power of two, like 16, 32, or 64 bits, at a time. A group of 8 bits is widely used as a fundamental unit, and has been given the name of byte. A computer's processor and Memory systems typically accept data as a byte or multiples of a byte at a time.

(In some cases 4 bits is a convenient number of bits to deal with, and this collection of bits is called, somewhat painfully, the nybble. However, the term byte is the more common.)

A nybble can encode 16 different values, such as the numbers 0 to 15. Any arbitrary sequence of bits could be used in principle, but in practice the most common scheme is:

   0000 = decimal 0           1000 = decimal 8
   0001 = decimal 1           1001 = decimal 9
   0010 = decimal 2           1010 = decimal 10
   0011 = decimal 3           1011 = decimal 11
   0100 = decimal 4           1100 = decimal 12
   0101 = decimal 5           1101 = decimal 13
   0110 = decimal 6           1110 = decimal 14
   0111 = decimal 7           1111 = decimal 15

This (rather than gray code) is used because it mimics humans' more usual decimal counting system. For example, given the decimal number:

7531

we are taught to interpret this as:

(7 × 1000) + (5 × 100) + (3 × 10) + (1 × 1)

or, using powers-of-10 notation:

(7 × 10³) + (5 × 10²) + (3 × 10¹) + (1 × 10⁰)

(Note that any non-zero number to the power zero is 1.)

Each digit in the number represents a value from 0 to 9, which is ten different possible values, and that's why it's called a decimal or base-10 number. Each digit also has a weight of a power of ten proportional to its position.

Similarly, in the binary number encoding scheme explained above, the value 13 is encoded as:

Each bit can only have a value of 1 or 0, which is two values, making this a binary, or base-2 number. Accordingly, the positional weighting is as follows:

  1101
  = (1 × 2³) + (1 × 2²) + (0 × 2¹) + (1 × 2⁰)
  = (1 × 8) + (1 × 4) + (0 × 2) + (1 × 1)
  = 13 decimal

Notice the values of powers of 2 used here: 1, 2, 4, 8. Older computer programmers generally got to know the powers of 2 up to the 16th power because they used them often:

   2⁰  =   1        2⁸ =    256
   2¹  =   2        2⁹ =    512
   2²  =   4        2¹⁰ =  1,024
   2³  =   8        2¹¹ =  2,048
   2⁴  =  16        2¹² =  4,096
   2⁵  =  32        2¹³ =  8,192
   2⁶  =  64        2¹⁴ = 16,384
   2⁷  = 128        2¹⁵ = 32,768
                               2¹⁶ = 65,536

Traditionally, in this context, unlike the International System of Units, the value 2¹⁰ = 1,024 is referred to as Kilo, or simply K, so any higher powers of 2 are often conveniently referred to as multiples of that value:

   2¹¹ =  2 K =  2,048    2¹⁴ = 16 K = 16,384
   2¹² =  4 K =  4,096    2¹⁵ = 32 K = 32,768
   2¹³ =  8 K =  8,192    2¹⁶ = 64 K = 65,536

Similarly, the value 2²⁰ = 1,024 × 1,024 = 1,048,576 is referred to as a Meg, or simply M:

   2²¹ = 2 M
   2²² = 4 M

and the value 2³⁰ is referred to as a Gig, or simply G.

However, in December 1998 the International Electrotechnical Commission produced new units for these power-of-two values, in order to bring prefixes such as kilo- and mega- back to their SI definitions. (See Binary prefix.)

(There is another subtlety in this discussion. If we use 16 bits, we can have 65,536 different values, but the values are from 0 to 65,535. Humans start counting at one, machines start counting from zero, since it's easier to program them this way. This detail often confuses.)

The binary scheme just outlined defines a simple way to count with bits, but it has a few restrictions:

You can only perform arithmetic within the bounds of the number of bits that you have. That is, if you are working with 16 bits at a time, you can't perform arithmetic that gives a result of 65,536 or more.
There's no way to represent fractions with this scheme. You can only work with non-fractional (integer) quantities.
There's no way to represent negative numbers with this scheme. All the numbers are zero or positive (unsigned).

Despite these limitations, such unsigned integer numbers are very useful in computers for counting things one-by-one. They're very simple for the computer to manipulate.

Why binary?

The logic that computers use (Boolean logic) is a two-valued logic, and thus the two states of a binary system can relate directly to two states in a logical system.
It is much easier to make hardware which can distinguish between two values than multiple values. Imagine a light switch as compared to a clock.
Binary is slightly more efficient than decimal. The optimal base to maximise efficiency is e (2.71818285...), but that would not be very accessible. A few experimental computers have been built with ternary (Base 3) representation, as it is also quite efficient.

Octal and hex number encoding

Converting between bases

Each of these number systems are positional systems, but while decimal weights are powers of 10, the octal weights are powers of 8 and the hex weights are powers of 16. To convert from hex or octal to decimal, for each digit one multiplies the value of the digit by the value of its position and then adds the results. For example:

 octal 756
 = (7 × 8²) + (5 × 8¹) + (6 × 8⁰) 
 = (7 × 64) + (5 × 8) + (6 × 1)  
 = 448   +  40    +  6   = decimal 494

 hex 3b2
 = (3 × 16²) + (11 × 16¹) + (2 × 16⁰)
 = (3 × 256) + (11 × 16) +  (2 × 1)
 = 768  + 176  +  2 = decimal 946

Thus, an octal digit has a perfect correspondence to a 3-bit binary value number:

   000 = octal 0
   001 = octal 1
   010 = octal 2
   011 = octal 3
   100 = octal 4
   101 = octal 5
   110 = octal 6
   111 = octal 7

Similarly, a hex digit has a perfect correspondence to a 4-bit binary number:

   0000 = hex 0       1000 = hex 8
   0001 = hex 1       1001 = hex 9
   0010 = hex 2       1010 = hex a
   0011 = hex 3       1011 = hex b
   0100 = hex 4       1100 = hex c
   0101 = hex 5       1101 = hex d
   0110 = hex 6       1110 = hex e
   0111 = hex 7       1111 = hex f

So it is easy to convert a long binary number, such as 1001001101010001, to octal:

   ₀₀1 001 001 101 010 001 binary = 
     1   1   1   5   2   1          111521 octal

and easier to convert that number to hex:

   1001 0011 0101 0001 binary =
      9    3    5    1          9351 hexadecimal

but it is harder to convert it to decimal (37713).

Conversion of numbers from hex or octal to decimal can also be done by using the following pattern.

d1 * base + d2 * base + dn........

Where the first digit in the number is multiplyed by the numbers base and added to the the second digit. To convert numbers with three digits or more the pattern is just continued.

Examples of this are shown below.

hex A1

d1=A(or decimal 10) d2=1 base=16

d1 * base + d2

10 * 16 + 1= decimal 161

hex 129

d1=1 d2=2 d3=9 base=16

d1 * base + d2 * base + d3= 1 * 16 + 2 * 16 + 9= decimal 297

The same method can be applied to conversion of octal and binary numbers:

binary 1011

d1=1 d2=0 d3=1 d4=1 base=2

d1 * base + d2 * base + d3 * base + d4=

1 * 2 + 0 * 2 + 1 * 2 + 1= decimal 11

octal 1232

d1=1 d2=2 d3=3 d4=2 base=8

d1 * base + d2 * base + d3 * base + d4=

1 * 8 + 2 * 8 + 3 * 8 + 2= decimal 666

Representing signed integers in binary

Binary numbers have no inherent way to representing negative numbers in a computer. In order to create these "signed integers" a few different systems have been developed. In each, a special bit is set aside as the "sign bit", which is usually the leftmost (most significant) bit. If the sign bit is 1 the number is negative; if 0, positive.

Sign and magnitude

The simplest way to depict a negative number, the sign is the most significant bit, with the magnitude being a binary number using the remaining bits. For example, using 4 bits:

 0101 = +5
 1101 = -5

One's complement

In order to "complement" (change the sign of) a binary number, a bitwise NOT operation is performed on the bits of the number. For example:

 0101 = +5
 1010 = -5

A side effect of both this and the previous system is that there are two representations for zero, one of the reasons this system is not very good for computing:

 0000 = +0
 1111 = -0

Two's complement

The most widely used system in modern computing. To form the two's complement, take the bitwise NOT of the number and add 1. For example:

 0101 = +5
 1011 = -5

Thus:

   0000 = decimal 0    1000 = decimal -8
   0001 = decimal 1    1001 = decimal -7
   0010 = decimal 2    1010 = decimal -6
   0011 = decimal 3    1011 = decimal -5
   0100 = decimal 4    1100 = decimal -4
   0101 = decimal 5    1101 = decimal -3
   0110 = decimal 6    1110 = decimal -2
   0111 = decimal 7    1111 = decimal -1

Using this system, 16 bits will encode numbers from -32,768 to 32,767, while 32 bits will encode -2,147,483,648 to 2,147,483,647.

Representing fractions in binary

Fixed-point numbers

Fixed-point formats are often used in business calculations (such as with spreadsheets or COBOL), where floating-point with insufficient precision is unacceptable when dealing when money. It is helpful to study it to see how fractions can be stored in binary.

An arbitrary number of bits must be chosen to store the fractional part of a number, and to store the integer part. For example, using a 32-bit format, 16 bits might be used for the integer and 16 for the fraction.

The fractional bits continue the pattern set by the integer bits: if the eight's bit is followed by the four's bit, then the two's bit, then the one's bit, then of course the next bit is the half's bit, then the quarter's bit, then the 1/8's bit, et cetera.

Examples:

                         Integer bits   Fractional bits
    0.5      =    1/2 = 00000000 00000000.10000000 00000000
    1.25     =  1 1/4 = 00000000 00000001.01000000 00000000
    7.375    =  7 3/8 = 00000000 00000111.01100000 00000000

However, using this form of encoding means that some numbers cannot be represented in binary. For example, for the fraction 1/5 (in decimal, this is 0.2), the closest one can get is:

   13107/65536 = 00000000 00000000.00110011 00110011 = 0.1999969... in decimal
   13108/65536 = 00000000 00000000.00110011 00110100 = 0.2000122... in decimal

And even with more digits, an exact representation is impossible. Consider the number 1/3. If you were to write the number out as a decimal (0.333333...) it would continue indefinitely. If you were to stop at any point, the number written would not exactly represent the number 1/3.

The point is: some fractions cannot be expressed exactly in binary notation... not unless you use a special trick. The trick is, to store a fraction as two numbers, one for the numerator and one for the denominator, and then use arithmetic to add, subtract, multiply, and divide them. However, arithmetic will not let you do higher math (such as square roots) with fractions, nor will it help you if the lowest common denominator of two fractions is too big a number to handle. This is why there are advantages to using the fixed-point notation for fractional numbers.

Floating-point numbers

While both unsigned and signed integers are used in digital systems, even a 32-bit integer is not enough to handle all the range of numbers a calculator can handle, and that's not even including fractions. To approximate the greater range and precision of real numbers we have to abandon signed integers and fixed-point numbers and go to a "floating-point" format.

In the decimal system, we are familiar with floating-point numbers of the form:

1.1030402 × 10⁵ = 1.1030402 × 100000 = 110304.02

or, more compactly:

   1.1030402E5

which means "1.103402 times 1 followed by 5 zeroes". We have a certain numeric value (1.1030402) known as a "significand", multiplied by a power of 10 (E5, meaning 10⁵ or 100,000), known as an "exponent". If we have a negative exponent, that means the number is multiplied by a 1 that many places to the right of the decimal point. For example:

2.3434E-6 = 2.3434 × 10^-6 = 2.3434 × 0.000001 = 0.0000023434

The advantage of this scheme is that by using the exponent we can get a much wider range of numbers, even if the number of digits in the significand, or the "numeric precision", is much smaller than the range. Similar binary floating-point formats can be defined for computers. There are a number of such schemes, the most popular has been defined by IEEE (Institute of Electrical & Electronic Engineers, a US professional and standards organization). The IEEE 754 standard specification defines a 64 bit floating-point format with:

an 11-bit binary exponent, using "excess-1023" format. Excess-1023 means the exponent appears as a unsigned binary integer from 0 to 2047, and you have to subtract 1023 from it to get the actual signed value
a 52-bit significand, also an unsigned binary number, defining a fractional value with a leading implied "1"
a sign bit, giving the sign of the number.

Let's see what this format looks like by showing how such a number would be stored in 8 bytes of memory:

   byte 0:         S    x10  x9   x8   x7   x6   x5   x4
   byte 1:         x3   x2   x1   x0   m51  m50  m49  m48
   byte 2:         m47  m46  m45  m44  m43  m42  m41  m40
   byte 3:         m39  m38  m37  m36  m35  m34  m33  m32
   byte 4:         m31  m30  m29  m28  m27  m26  m25  m24
   byte 5:         m23  m22  m21  m20  m19  m18  m17  m16
   byte 6:         m15  m14  m13  m12  m11  m10  m9   m8
   byte 7:         m7   m6   m5   m4   m3   m2   m1   m0

where "S" denotes the sign bit, "x" denotes an exponent bit, and "m" denotes a significand bit. Once the bits here have been extracted, they are converted with the computation:

<sign> × (1 + <fractional significand>) × 2^{<exponent> - 1023}

This scheme provides numbers valid out to about 15 decimal digits, with the following range of numbers:

	maximum	minimum
positive	1.797693134862231E+308	4.940656458412465E-324
negative	-4.940656458412465E-324	-1.797693134862231E+308

The spec also defines several special values that are not defined numbers, and are known as NaNs, for ‘Not A Number’. These are used by programs to designate invalid operations and the like. You will rarely encounter them and NaNs will not be discussed further here. Some programs also use 32-bit floating-point numbers. The most common scheme uses a 23-bit significand with a sign bit, plus an 8-bit exponent in "excess-127" format, giving 7 valid decimal digits.

   byte 0:         S    x7   x6   x5   x4   x3   x2   x1   
   byte 1:         x0   m22  m21  m20  m19  m18  m17  m16  
   byte 2:         m15  m14  m13  m12  m11  m10  m9   m8   
   byte 3:         m7   m6   m5   m4   m3   m2   m1   m0

The bits are converted to a numeric value with the computation:

<sign> × (1 + <fractional significand>) × 2^{<exponent> - 127}

leading to the following range of numbers:

	maximum	minimum
positive	3.402823E+38	2.802597E-45
negative	-2.802597E-45	-3.402823E+38

Such floating-point numbers are known as "reals" or "floats" in general, but with a number of inconsistent variations, depending on context:

A 32-bit float value is sometimes called a "real32" or a "single", meaning "single-precision floating-point value".

A 64-bit float is sometimes called a "real64" or a "double", meaning "double-precision floating-point value".

The term "real" without any elaboration generally means a 64-bit value, while the term "float" similarly generally means a 32-bit value.

Once again, remember that bits are bits. If you have 8 bytes stored in computer memory, it might be a 64-bit real, two 32-bit reals, or 4 signed or unsigned integers, or some other kind of data that fits into 8 bytes.

The only difference is how the computer interprets them. If the computer stored four unsigned integers and then read them back from memory as a 64-bit real, it almost always would be a perfectly valid real number, though it would be junk data.

So now our computer can handle positive and negative numbers with fractional parts. However, even with floating-point numbers you run into some of the same problems that you did with integers:

As with integers, you only have a finite range of values to deal with. Granted, it's a much bigger range of values than even a 32-bit integer, but if you keep multiplying numbers you'll eventually get one bigger than the real value can hold and have a "numeric overflow".

If you keep dividing you'll eventually get one with a negative exponent too big for the real value to hold and have a "numeric underflow". Remember that a negative exponent gives the number of places to the right of the decimal point and means a really small number. The maximum real value is sometimes called "machine infinity", since that's the biggest value the computer can wrap its little silicon brain around.

A related problem is that you have only limited "precision" as well. That is, you can only represent 15 decimal digits with a 64-bit real. If the result of a multiply or a divide has more digits than that, they're just dropped and the computer doesn't inform you of an error.

This means that if you add a very small number to a very large one, the result is just the large one. The small number was too small to even show up in 15 or 16 digits of resolution, and the computer effectively discards it. If you are performing computations and you start getting really insane answers from things that normally work, you may need to check the range of your data. It's possible to "scale" the values to get more accurate results. It also means that if you do floating-point computations, there's likely to be a small error in the result since some lower digits have been dropped. This effect is unnoticeable in most cases, but if you do some math analysis that requires lots of computations, the errors tend to build up and can throw off the results. The faction of people who use computers for doing math understand these errors very well, and have methods for minimizing the effects of such errors, as well as for estimating how big the errors are. By the way, this "precision" problem is not the same as the "range" problem at the top of this list. The range issue deals with the maximum size of the exponent, while the resolution issue deals with the number of digits that can fit into the significand.

Another more obscure error that creeps in with floating-point numbers is the fact that the significand is expressed as binary fraction that doesn't necessarily perfectly match a decimal fraction.

That is, if you want to do a computation on a decimal fraction that is a neat sum of reciprocal powers of two, such as 0.75, the binary number that represents this fraction will be 0.11, or 1/2 + 1/4, and all will be fine. Unfortunately, in many cases you can't get a sum of these "reciprocal powers of 2" that precisely matches a specific decimal fraction, and the results of computations will be very slightly off, way down in the very small parts of a fraction. For example, the decimal fraction "0.1" is equivalent to an infinitely repeating binary fraction: 0.000110011 ...

If you don't follow all of this, don't worry too much. The point here is that a computer isn't magic, it's a machine and is subject to certain rules and constraints. Although many people place a childlike faith in the answers computers give, even under the best of circumstances these machines have a certain unavoidable inexactness built into their treatment of real numbers.

Numbers in programming languages

Low-level programmers have to worry about unsigned and signed, fixed and floating-point numbers. They have to write wildly different code, with different opcodes and operands, to add two floating point numbers compared to the code to add two integers.

However, high-level programming languages such as LISP and Python offer an abstract number that may be an expanded type such as rational, bignum, or complex. Programmers in LISP or Python (among others) have some assurance that their program code will Do The Right Thing with mathematical operations. Due to operator overloading, mathematical operations on any number — whether signed, unsigned, rational, floating-point, fixed-point, integral, or complex — are written exactly the same way. Others languages such as REXX and Java provide decimal floating-point which avoids many 'unexpected' results.

Text are numbers: ASCII and strings

So now we have several means for using bits to encode numbers. But what about text? How can a computer store names, addresses, letters to your folks?

Well, if you remember that bits are bits, there's no reason that a set of bits can't be used to represent a character like "A" or "?" or "z" or whatever. Since most computers work on data a byte at a time, it is convenient to use a single byte to represent a single character. For example, we could assign the bit pattern:

   0100 0110 (hex 46)

to the letter "F", for example. The computer sends such "character codes" to its display to print the characters that make up the text you see.

There is a standard binary encoding for western text characters, known as the "American Standard Code for Information Interchange" (ASCII).

The ASCII table serves to emphasize one of the main ideas of this document: bits are bits. In this case, you have bits representing characters. You can describe the particular code for a particular character in decimal, octal, or hexadecimal, but it's still the same code. The value that is expressed, whether it is in decimal, octal, or hex, is simply the same pattern of bits.

Of course, you normally want to use many characters at once to display sentences and the like, such as:

Tiger, tiger burning bright!

This of course is simply represented as a sequence of ASCII codes, represented in hex below:

54 69 67 65 72 2c 20 74 69 67 65 72 20 62 75 ...

Computers store such "strings" of ASCII characters as "arrays" of consecutive memory locations. Some applications include a binary value as part of the string to show many characters are stored in it. More commonly, applications use a special character, usually a NULL (the character with ASCII code 0), as a "terminator" to indicate the end of the string. Most of the time users will not need to worry about these details, as the application takes care of them automatically, though if you are writing programs that manipulate characters and strings you will have to understand how they are implemented.

Now let's consider a particularly confusing issue for the newcomer: the fact that you can represent a number in ASCII as a string, for example:

1.537E3

When a computer displays this value, it actually sends the following ASCII codes, represented in hex, to the display:

31 2e 35 33 37 45 33

The confusion arises because the computer could store the value 1.537E3 as, say, a 32-bit real, in which you get a pattern of 4 bytes that make up the exponent and significand and all that. To display the 32-bit real, the computer has to translate it to the ASCII string just shown above, as an "ASCII numeric representation", or just "ASCII number". If it just displayed the 32-bit binary real number directly, you'd get four "garbage" characters. But, now to get really confusing, suppose you wanted to view the bits of a 32-bit real directly, bypassing conversion to the ASCII string value. Then the computer would display something like:

10110011 10100000 00110110 11011111

The trick is that to display these values, the computer uses the ASCII characters for "1", "0", and " " (space character), with hex values as follows:

31 30 31 31 30 30 31 31 20 31 30 31 30 30 ...

It could also display the bits as an octal or hex ASCII value. We often get queries from users saying they are dealing with "hex numbers". On investigation it usually proves that they are manipulating binary values that are presented by hex numbers in ASCII.

Confused? Don't feel too bad, even experienced people get subtly confused with this issue sometimes. The essential point is that the values the computer works on are just sets of bits. For you to actually see the values, you have to get an ASCII representation of them. Or to put it simply: machines work with bits and bytes, humans work with ASCII, and there has to be translation to allow the two to communicate.

8 bits is clearly not enough to allow representation of, say, Japanese characters, since their basic set is a little over 2,000 different characters. As a result, to encode Asian languages such as Japanese or Chinese, computers use a 16-bit code for characters. There are a variety of specs for encoding non-Western characters, the most widely used being "Unicode", which provides character codes for Western, Asian, Indic, Hebrew, and other character sets, including even Egyptian hieroglyphics.

Your Online Encyclopedia

Online Encylopedia and Dictionary Research Site