The Outer Shell of Floating Point
Charles Douglas Wehner
24 September 2008
These notes are for guidance only.
No liability accepted.

When dealing with floating point, so much depends on the standard to which you are working.

In my advanced Forth, I had in the memory of the computer the following:

MMMMEE

..where M is the "Mantissa", and E the "Exponent", the figures are bytes and the system is "little-endian", in that the low bytes are to the left.

Four bytes of Mantissa give a number up to 4294967295. Taking the logarithm to base 10, we have 9.63. So we say that this is a 9.63-place mathematical system.

In my Forth floating-point, numbers up to 4.294967295 were displayed as ten figures, whilst the number 4.294967296 and above were displayed as nine. The same applied to 42.94967295 and 429.4967295 and so on.

The reason for this is that I had overcome a set-theoretical problem of how to allow machine-code to compile and then decompile. Programs written with the floating-point could be made to display themselves to the full accuracy, therefore. So one could program in Forth, decompile and save the result, load and compile it again, and never lose accuracy.

The two-byte Exponent allows numbers up to ten to the power 9863, and down to ten to the power -9864. This is important, because the author also implemented his fractional factorials (fractorials), very closely related to Euler's gamma function but also to the integer factorials. Factorials cover a huge dynamic range, and so this range of Exponent is needed.

A one-byte Exponent will only reach ten to the 38 and ten to the -38.

I produced many mathematical system, including a mathematics pack to 65 places of decimals, as well as converters for IEEE Real 32, 64 and 80.

I will try to impart as much of this knowledge as possible in as few words as possible.

In the binary examples given below, the MMMM group will be shown as "big-endian" followed by the EE group also as "big-endian". This is to fall in line with standard Arabic numerals where 123 is one hundred and twenty three, not three hundred and twenty one. A "shift-left" would create 1230 - multiplication by ten, where a binary shift-left is a doubling. In similar vein, a binary shift-right halves the number, with the loss into "carry" or "borrow" of the least significant bit.

Readout

Starting with a very simple concept, we have a byte FF. Extend it to 00FF.

Copy it. Shift it left twice. Add it on (that gives five). Shift once more left. The result in the top byte is a 9. Print it. Clear the top byte.

Again, copy, shift twice, add, shift. Another 9. Again. Now it is 6. Then 0, then 9, then 3, then 7, then 5 and then forever 0.

99609375 is an eight-decimal number. It came from an eight-bit (binary) number. What is it? It is 255 divided by 256. It is what I call the "fixed leading-point" representation of 255. It would be 0.FF in Hex. or 0.11111111 in Binary. Therefore it is correctly 0.99609375 .

This is the quickest way of printing out numbers in decimal.

In Forth there is a variable known as "BASE". There is no other way to print apart from dividing by "BASE" by (slow) long division, putting the remainder (modulus) in the "PAD", decrementing the pad-pointer and dividing again. Long division is slow, but I could write:

3.141592654 fpi f!
16 BASE !
fpi f.
3.243f6A8 OK>
(16 hex places of Pi are 3.243f6a8885dd0f9, showing a worst-case error of 1 in the final digit of the eight-place printout.)

So if you do NOT want a variable base, it is quite acceptable to write the printout routine as a shift-and-add system.

Bear in mind that the European standard would be 3,141592654 (a COMMA rather than a full-stop or "period").

If the exponent EE is zero, the whole number is zero.

If the numbers were:
00 00 00 00 00 80 (shown little-endian, hex)

then we would have, in binary, the Mantissa and Exponent as follows:

00000000 00000000 00000000 00000000 10000000 00000000
(two big-endian groups, binary)

The Mantissa is ALWAYS left-justified:

10000000 00000000 00000000 00000000

So why is the highest bit zero? Because, as we know it is always set in a non-zero number, we can overwrite it with the "signum" (sign). Zero means that the number is POSITIVE, in most systems.

00 00 00 00 00 80 (little-endian) represents the number 1.0 in my system. I believe that this was also the case with Microsoft Basic-A, and with Locomotive Mallard Basic and many other systems. The difference was that they had FIVE bytes of Mantissa, and just 80-Hex for the Exponent.

So, to distinguish the 1 from the zero, we have set the top Exponent bit.

To print out, we print either, perhaps, a space or a plus-sign to signify a leading 0 in the Mantissa, or a minus-sign to signify a leading 1. Then we force the top Mantissa bit to 1.

To find out how many SHIFTS-LEFT belong in this Mantissa, we deduct the "OFFSET" of 32768 (8000 Hex) for 2-bytes or 128 (80 Hex) for 1-byte from the Exponent. I seem to remember that in IEEE Real-32 the equivalent offset was 127 (7F Hex), but am not sure. Details of Real-32, 64 and 80 are on the Internet, with examples.

One creates a variable (perhaps "DECPLACES") for decimal places. Then, one exchanges BINARY shifts for DECIMAL ones by MULTIPLYING ON THE SPOT.

A routine takes the Mantissa, shifts a copy twice to the RIGHT, and adds it on. This gives 1.25 times the number. We bear in mind that THREE left shifts in Binary will also multiply by 8, and eight times one-and-a-quarter is TEN.

If it overflowed, however, the routine will shift the Mantissa right in order to recapture the overflow bit, and the number of left shift needed will be FOUR.

So, if the Exponent is negative, we use that routine and add the three or four to the Exponent. We then deduct 1 from the "DECPLACES".

If the Exponent is still negative, we recycle. Eventually, the Exponent is no longer negative - and we may have zero, one, two or three bits positive. We shift that number of times into a leading yero byte to obtain the first decimal digit.

Looking at "DECPLACES", we may decide to make the system print in scientific notation "1.23456789E-23", or in leading-zero notation "0.0123456789". Routines for these styles of printing have to be written.

The other MULTIPLYING ON THE SPOT routine is one I invented, and have not heard of before. It is used when the exponent is positive, to divide by ten whilst incrementing DECPLACES.

You take the number, and copy it. The copy is shifted ONCE RIGHT and added. The result is copied and shifted FOUR TIMES RIGHT and added. The result is copied and shifted EIGHT places and added. Then SIXTEEN, THIRTY-TWO or whatever, to the full accuracy of the Mantissa.

This multiplies the number by 0.000110011001100110011001100 by the following stages:

0.00011
0.000110011
0.00011001100110011
0.000110011001100110011001100110011

This shows exponential growth of the number of places of accuracy.

As 0.00011001100110011001100110011 is one TENTH, we have divided on the spot by mutiplying by one tenth.

However, it was really a multiplication-on-the-spot by 1.10011001100110011001100110011

This is 1.6, so we need to divide by 16 by deducting FOUR from the Exponent. If any of the stages had overflowed, however, we would have needed to shift right to catch the carry bit. Then we would have deducted only three from the Exponent.

This is done to the Mantissa whilst the Exponent remains positive, and each time the "DECPLACES" is incremented.

So now we can print either 123.456789 or 1.23456789E2 or 1.23456789E+2 by examining the size of "DECPLACES".

Having evaluated the DECPLACES, and provided routines to handle them, we construct the shift-and-add printout (which gave us 99609375) in such a way as to tie it all together.

This fast system is essential when a giant mathematical system is built, because it would otherwise be unbearably slow. The 65-place mathematics printed out as fast as many 7-place systems.

This print-out also gives us experience of handling floating point. Simulated "packages" of floating point can be fed to it, to gain experience.

A refinement to the print-out routine was the rounding. The Mantissa of 0.4294967295 is 1844674407. A Mantissa that is not zero and does not exceed this will print out to ten places. A Mantissa above this will deliver nine.

Therefore, a small number representing a five in the tenth place was added to Mantissae above this number. This ensured that on the nine-figure numbers rounding-up would take place as often as rounding-down. Here, rounding-up means that a number having, say 3 in the ninth place and 5 or more in the tenth would get a 4 in the ninth place instead. Rounding down means that a number ending in a 3 and a number below 5 would keep the 3.

By careful adjustment, it was possible to ensure that after a single rounding operation, the readout and read-in routines did not cause the accuracy to degrade any further.

The 9.63 place arithmetic was designed for an eight-place readout, as with most pocket calculators. That meant that it had 1.63 places of hidden precision.

The 9/10 place full precision readout was for compiling and decompiling purposes. In short, for the handling of the programs.

Read-in

Now we read in.

Take an integer, such as 9. Add it into the four-byte Mantissa:

00000000 00000000 00000000 00001001

Multiply by shift-and-add (10)

00000000 00000000 00000000 01011010

Add 9:
00000000 00000000 00000000 01100011

multiply by 10 and add 6. Then multiply and add nothing, then 9,3,7 and 5.

00000000 00000000 00000011 11100100 (996)
00000000 00000000 00100110 11101000 (9960)
00000000 00000001 10000101 00011001 (99609)
00000000 00001111 00110010 11111101 (996093)
00000000 10010111 11111101 11101001 (9960937)
00001001 11101111 11101011 00011111 (99609375)

This is four bits short of left-justified:

10011110 11111110 10110001 11110000 (99609375 times 16)

If we had used the number 1 instead of 99609375, we would have needed 31 shifts to left-justify it. So a 1 can be floated by taking OFFSET+31 and decrementing it for each left shift until justified. That leaves just the offset (32768 or 128).

Similarly, the 99609375 starts at OFFSET+31, and loses 4. That makes OFFSET+27. The eight decimal places have been converted into 27 binary places in the Exponent byte.

Counting out the number of decimals that are assimilated after a decimal point or comma was detected allows us to post-process to obtain numbers such as 0.99609375 . This is done by the "on-the-spot" routines.

Given a read-out and a read-in routine, we can proceed to create a four-function calculator.

Add

We have to bear in mind that one or other number may be negative. This means that addition contains a hidden subtraction routine.

Look at the two "Signa" (signs), which are usually the leading bits of the Mantissae. XOR one against the other. Is the result a zero? Then use the addition to be described.

Add by Addition

Save the signum. Set both Mantissa-leading-bits. Compare the Exponents. Which is smaller, and by how much?

Shift the smaller Mantissa right by that number of places. Add it to the other Mantissa. Did it carry? If so, shift the result right to rescue the bit, and increment the Exponent of the larger number.

Put the saved signum in place of the top bit of the result Mantissa. The larger Exponent and the Mantissa are now the result.

Exponent overflow must be trapped by a routine.

Add by Subtraction

The signa when XORed against each other did not give zero.

Save the signum of ONE. Set both Mantissa-leading-bits. Compare the Exponents. Which is smaller, and by how much?

If the signum saved is of the smaller, reverse it.

Shift the smaller Mantissa right by that number of places. Subtract it from the other Mantissa. Was it zero? If so, jump to a subroutine that delivers FP-ZERO (usually all bytes zero).

Is the larger Mantissa still left justified? Then transfer the signum and deliver it with its Exponent as the result.

If not left justified, keep shifting it left until it is, decrementing the Exponent and checking for Exponent-underflow (tiny numbers jump to FP-ZERO).

Subtract

Reverse the signum of the subtractor, and jump to the ADD routine.

Multiply

Xor the Signa for the Signum of the result.

Set the leading bit of both Mantissae.

NOTE THAT IN FORTH (THE ONLY SCIENTIFIC STUDY OF SUBROUTINES) THE MAIN MULTIPLY IS UM* ("you-em-star", no offence meant). THIS TAKES TWO SINGLE CELLS AND CREATES A DOUBLE CELL. SIMILARLY, ON THE PENTIUM, IF EAX IS A 32-BIT CELL, AND SO IS EBX, RESULT EDX:EAX OF MUL EBX IS A DOUBLE CELL.

"LEADING" MULTIPLY, FOR FLOATING POINT, TRIES TO LEFT-JUSTIFY EDX:EAX BEFORE DROPPING EAX. "TRAILING" MULTIPLY, FOR INTEGERS, LOOKS AT EDX FOR OVERFLOW AND PRESERVES ONLY EAX.

Multiply Mantissae together. If EDX is left-justified, simply overwrite the leading bit with the final Signum.

If EDX is not left-justified, there will ONLY be a single leading zero. Shift left or add the double-register to itself. Decrement one Exponent. Overwrite the top bit with the Signum.

Add together the Exponents, and subtract one offset. Is the Exponent in range, or is it too big or too small? If too small, jump to FP-ZERO. If too big, use either a routine to report "FP OVERFLOW", or reserve just one code - such as
11111111 11111111 11111111 11111111 11111111 11111111
to represent excessive numbers.

Divide

Division is carried out on simple processors by repeated shifting and subtraction. If there is no borrow, a bit in the quotient is set. If there is a borrow, the divisor is added back to the dividend, and the quotient bit is reset.

However, there is a more advanced technique discovered by the author, and possibly by others, in which the add-back described above is not needed. This makes the process faster. It will be described here.

But first:
Load the Exponent of the dividend into a register and subtract the offset. My own preference for an offset of 128 or 32768 shows one advantage - one can simply reverse the top bit instead.

Load the Exponent of the divisor, and check it for zero. Jump to a "division by zero" routine (error halt or whatever) if zero.

My own preference for an offset of 128 or 32768 shows now. We subtract the (larger) divisor-Exponent from the (smaller) dividend-Exponent. This automatically creates the correct quotient-Exponent and a carry. If the carry does not occur, there has been an overflow.

Xor the Signa for the Signum of the result.

Set the leading bit of both Mantissae.

Subtract the Exponent of the divisor from the Exponent of the dividend. The offsets will cancel each other, so it wastes time first to subtract offsets and then subtract.

Then:
We will be creating TWO routines. One will divide by subtraction, the other by addition.

Consider the number 10011110 11111110 10110001 11110000 (99609375 times 16)

This is put into a double register as dividend:

10011110 11111110 10110001 11110000 00000000 00000000 00000000 00000000

Let us assume that the divisor has a Mantissa of one and a quarter:

10100000 00000000 00000000 00000000

We subtract the divisor from the dividend:

11111110 11111110 10110001 11110000 (borrow)

now we know that we must add back that one-and-a-quarter. However, we can do this by adding back only half, then a quarter, then an eighth and so on until the routine ends.

The borrow made the program jump to the "divide by addition" loop at a point where the carry flag is reset and the entire double register is shifted left (doubled) with carry (borrow):

11111101 11111101 01100011 11100000 00000000 00000000 00000000 00000000

The loop counter is counted down from 32 to 31 because there must be 32 left shifts in total.

Now the one-and-a-quarter is added back to the doubled dividend. This is the same as adding half of that divisor to that dividend:

10011101 11111101 01100011 11100000 00000000 00000000 00000000 00000000

It now carries. So we must jump to the "divide by subtraction" routine at a point where the carry flag is set and the entire double register is shifted left (doubled) with carry (borrow):

00111011 11111010 11000111 11000000 00000000 00000000 00000000 00000001

We can see that at the next subtraction it will again spring to the "divide by addition" loop:

10011011 11111010 11000111 11000000 00000000 00000000 00000000 00000001
It borrowed.

00110111 11110101 10001111 10000000 00000000 00000000 00000000 00000010

What emerges is that the remainder is formed in the leftmost four bytes whilst the quotient is formed in the rightmost four bytes.

However, if the 32-times counter counts to nil whilst the program counter is in the "divide by addition" loop, there will be a fragment of remainder missing. So it must leap out of the "divide by addition" loop via an "add divisor to dividend" instruction.

To understand that, imagine that we owed a number and paid back a HALF, then a QUARTER, then an EIGHTH of it. If we pay back another EIGHTH we have paid it all back. If it had been four stages, we would be paying back a SIXTEENTH. So for any number of stages we are always paying back in full.

If the countdown ends in the "divide by subtraction" loop, it just leaps out without that final addition.

If only the quotient is needed, as in the creation of the mantissa of the floating-point quotient, we do not bother to have the final addition after the "divide by addition" loop. This saves a couple of clock cycles.

It will be seen that if the carry (borrow) is not set by the loops after addition or subtraction (whichever it may be), the topmost bit is always zero so that the shifting of the quotient is delivering a string of zeroes into the carry.

However, when the carry (borrow) is set by the addition or subtraction, the leading bit is set. However, the "divide by addition" is the complement of the "divide by subtraction" routine and vice-versa. Thus, after the leap to the complementary loop, the discarding of the 1 into the carry is equivalent to discarding a zero into the carry.

At this point, we check the top bit of the quotient we have generated. If it is reset, we shift the quotient left (double it) and decrement the quotient-Exponent. If that Exponent goes to zero, the number is too small, so we jump to a routine that creates FP-ZERO. In this case it needs only to zero the Mantissa.

The lowest Mantissa is 10000000 00000000 00000000 00000000, which represents unity. the highest is 11111111 11111111 11111111 11111111, which is just short of 2. Division of the lowest by the highest cannot therefore give a number below a half. Accordingly, the top bit alone might be reset, but two or more bits will not be. We need to check the top bit only once therefore.

Scientific Functions

When the above has been programmed, one can take the functions described under "The Inner Core of Floating Point" and add them into the package.

That is how the author's many mathematical systems were developed.

"The Inner Core of Floating Point" focuses mainly on the finer details of transcendental functions - the means to achieve rapid "convergence" of a "polynomial". However, both the Exponent and Mantissa are described.

The details given above describe not only the basic four functions, read in and print out, but also the means of "packing" the floating point and means of achieving speed.

When the system described has been written, the high-speed subroutines will be available to the transcendental functions as needed.

Logic Functions including Fractional

The package had a number of original floating-point logic functions. To make these functions easy to program, there were programs to separate the integer part of a number from its fractional part.

"Fint" - floating integer - might be used to find the integer part of pi:

11001001 00001111 11011010 10100000 10000000 00000010

We see from the Exponent on the right (big-endian) that it has an offset of 32768 and a "shift left" of two. So we make a mask for two bits:

11000000 00000000 00000000 00000000

This is combined by means of the AND function with the original Mantissa.

So there is your three:
11000000 00000000 00000000 00000000 10000000 00000010

"Ffrac" - floating fraction - is made using the complementary mask:

00111111 11111111 11111111 11111111

When the AND function is used, the zero flag will be set if the fractional part is zero, allowing a leap to FP-ZERO. With pi this does not happen:

00001001 00001111 11011010 10100000 10000000 00000010

This is not left-justified, so we need to shift the Mantissa left until it is. It requires four shifts:

10010000 11111101 10101010 00000000 011111111 11111110

Shifting left doubles the number, so for each shift we decremented the Exponent. This left an exponent of 32766, which is below 32768 and therefore represents a fraction.

This number is 0.141592653

Before floating-point logic functions are implemented, the lesser argument must be shifted right and its Exponent incremented until the two Exponents match.

"Fand" - floating-point AND - was made by adding a 2 to the Mantissa of each number. That is for rounding. If it overflows, the carry is shifted back, the other argument also shifted right and the Exponent incremented. Both numbers are then masked to clear the bottom two bits. Then the CPU's AND function is applied.

The result would be that if we write

9.875 5.75 fand f.

the Forth system would give

1.75 OK>

However, if we write

9.875 5.75 fand fint f.

we get

1.0 OK>

That is the same as conventional integer-logic AND.

And if we write

9.875 5.75 fand ffrac f.

we get

0.75 OK>

All the usual logic functions, OR, EXOR and so on, were written in the way described, giving fractional logic to 30 binary places.

The loss of two bits was essential to prevent the system from seeming to have gone wrong. For example, 0.9999999999 would be rounded up by the printout routine. It would look like "1.0".

"1.0 1.0 fand f." should give "1.0".

However, if the first "1.0" is secretly 0.9999999999 and the other a genuine 1.0, a wrong result would emerge. So all numbers were rounded for safety.

Many other novelties were created, including a search for rational numbers.

Consider the number 7.33333333. Is it rational?

We take a copy, and use Ffrac. This delivers 0.33333333. We find the reciprocal, which is 3. We multiply the original by a copy of the 3.

This delivers 22 and 3. So 7.33333333 is 22 divided by 3.

This technique was implemented on the giant laboratory mathematical system of 224 bits (67.43 decimal places displaying as 65 with 2.43 hidden).

The reason for the absurd precision can be seen. With a mathematical system of this enormity, it would be very rare that a number is mistaken for an integer when it is not.

Rounding

Only the printout, fractional logic routines and rational-number test used rounding. The four functions, add, subtract, divide and multiply as well as the transcendental functions, were implemented without rounding.

When printing out to eight decimal places, the number 21 was added on to encouraqe rounding to eight places. A smaller number, 2, was used when rounding to nine.

30 Bits

A mask was used to clear the bottom two bits after rounding for certain applications, such as a search for rational numbers. In the example given above, we had 0.333333333

The reciprocal might be 3.00000001

Clearing the bottom bits sets it to 3. Accuracy is still correct to one part in 1,073,741,824.

Efficiency

All routines and subroutines were written by reference to the number of clock cycles taken by each instruction. For example, if a binary shift left is slower than an addition like ADD EAX,EAX, we use the latter.

Each instruction was examined to see if it could do several things at once. For example. if one subtracts the offset one can already use two conditional jumps. JZ (jump if zero) would jump, perhaps, to a printout routine that uses no shifts, bypassing the multiplication-on-the-spot. JC (jump if carry) would jump to a routine that prints numbers below unity (shifting right). No jump would lead to a routine that prints numbers by shifting left.

Most numbers would print by shifting left or right. Therefore, for preference one uses the JC before the JZ to obtain a trace more long-term efficiency.

Advanced Central Processors

Of course, the high-speed division need not have been described, if one is programming for a central processor like a Pentium. Such devices have built-in mathematics hardware, just as earlier devices had arithmetic co-processors.

However, for total mastery of floating-point it is essential to know how to write a multiplication and division routine. Most pocket calculators have very simple processors inside them, so that all functions are in software. Competence across the spectrum leads to confidence when faced with any unfamiliar processor.

When writing for a Pentium one would, of course, simply use the MUL and DIV instructions. Writing software to replace fast hardware which exists simply wastes the power of the machine.


The lifelong possessions of the author were stolen by the British government.
These notes were prepared to preserve whatever could be preserved, and to encourage its dissemination.

Further reading:
http://wehner.org/fpoint
http://wehner.org/euler

HOME

(C) 1983-2008 Charles Douglas Wehner.
Use freely but do not plagiarise.