|
Charles Douglas Wehner 24 September 2008
When dealing with floating point, so much depends on the standard to which you are working. In my advanced Forth, I had in the memory of the computer the following: MMMMEE ..where M is the "Mantissa", and E the "Exponent", the figures are bytes and the system is "little-endian", in that the low bytes are to the left. Four bytes of Mantissa give a number up to 4294967295. Taking the logarithm to base 10, we have 9.63. So we say that this is a 9.63-place mathematical system. In my Forth floating-point, numbers up to 4.294967295 were displayed as ten figures, whilst the number 4.294967296 and above were displayed as nine. The same applied to 42.94967295 and 429.4967295 and so on. The reason for this is that I had overcome a set-theoretical problem of how to allow machine-code to compile and then decompile. Programs written with the floating-point could be made to display themselves to the full accuracy, therefore. So one could program in Forth, decompile and save the result, load and compile it again, and never lose accuracy. The two-byte Exponent allows numbers up to ten to the power 9863, and down to ten to the power -9864. This is important, because the author also implemented his fractional factorials (fractorials), very closely related to Euler's gamma function but also to the integer factorials. Factorials cover a huge dynamic range, and so this range of Exponent is needed. A one-byte Exponent will only reach ten to the 38 and ten to the -38. I produced many mathematical system, including a mathematics pack to 65 places of decimals, as well as converters for IEEE Real 32, 64 and 80. I will try to impart as much of this knowledge as possible in as few words as possible. In the binary examples given below, the MMMM group will be shown as "big-endian" followed by the EE group also as "big-endian". This is to fall in line with standard Arabic numerals where 123 is one hundred and twenty three, not three hundred and twenty one. A "shift-left" would create 1230 - multiplication by ten, where a binary shift-left is a doubling. In similar vein, a binary shift-right halves the number, with the loss into "carry" or "borrow" of the least significant bit.
Readout
Copy it. Shift it left twice. Add it on (that gives five). Shift once
more left. The result in the top byte is a 9. Print it. Clear the top byte.
Again, copy, shift twice, add, shift. Another 9. Again. Now it is 6.
Then 0, then 9, then 3, then 7, then 5 and then forever 0.
99609375 is an eight-decimal number. It came from an eight-bit (binary) number.
What is it? It is 255 divided by 256. It is what I call the "fixed
leading-point" representation of 255. It would be 0.FF in Hex. or
0.11111111 in Binary. Therefore it is correctly 0.99609375 .
This is the quickest way of printing out numbers in decimal.
In Forth there is a variable known as "BASE". There is no other way to
print apart from dividing by "BASE" by (slow) long division, putting the
remainder (modulus) in the "PAD", decrementing the pad-pointer and dividing
again. Long division is slow, but I could write:
So if you do NOT want a variable base, it is quite acceptable to write
the printout routine as a shift-and-add system.
Bear in mind that the European standard would be 3,141592654 (a COMMA
rather than a full-stop or "period").
If the exponent EE is zero, the whole number is zero.
If the numbers were:
then we would have, in binary, the Mantissa and Exponent as follows:
00000000 00000000 00000000 00000000 10000000 00000000
The Mantissa is ALWAYS left-justified:
10000000 00000000 00000000 00000000
So why is the highest bit zero? Because, as we know it is always set
in a non-zero number, we can overwrite it with the "signum" (sign).
Zero means that the number is POSITIVE, in most systems.
00 00 00 00 00 80
(little-endian) represents the number 1.0 in my system. I believe
that this was also the case with Microsoft Basic-A, and with Locomotive
Mallard Basic and many other systems. The difference was that they had
FIVE bytes of Mantissa, and just 80-Hex for the Exponent.
So, to distinguish the 1 from the zero, we have set the top Exponent bit.
To print out, we print either, perhaps, a space or a plus-sign to signify
a leading 0 in the Mantissa, or a minus-sign to signify a leading 1. Then we
force the top Mantissa bit to 1.
To find out how many SHIFTS-LEFT belong in this Mantissa, we deduct
the "OFFSET" of 32768 (8000 Hex) for 2-bytes or 128 (80 Hex) for 1-byte
from the Exponent. I seem to remember that in IEEE Real-32 the equivalent
offset was 127 (7F Hex), but am not sure. Details of Real-32, 64 and 80
are on the Internet, with examples.
One creates a variable (perhaps "DECPLACES") for decimal places. Then,
one exchanges BINARY shifts for DECIMAL ones by MULTIPLYING ON THE SPOT.
A routine takes the Mantissa, shifts a copy twice to the RIGHT, and adds
it on. This gives 1.25 times the number. We bear in mind that THREE left
shifts in Binary will also multiply by 8, and eight times one-and-a-quarter
is TEN.
If it overflowed, however, the routine will shift the Mantissa right in
order to recapture the overflow bit, and the number of left shift needed
will be FOUR.
So, if the Exponent is negative, we use that routine and add the three or
four to the Exponent. We then deduct 1 from the "DECPLACES".
If the Exponent is still negative, we recycle. Eventually, the Exponent is
no longer negative - and we may have zero, one, two or three bits positive.
We shift that number of times into a leading yero byte to obtain the first
decimal digit.
Looking at "DECPLACES", we may decide to make the system print in scientific
notation "1.23456789E-23", or in leading-zero notation "0.0123456789".
Routines for these styles of printing have to be written.
The other MULTIPLYING ON THE SPOT routine is one I invented, and have not
heard of before. It is used when the exponent is positive, to divide by ten
whilst incrementing DECPLACES.
You take the number, and copy it. The copy is shifted ONCE RIGHT and added.
The result is copied and shifted FOUR TIMES RIGHT and added. The result is
copied and shifted EIGHT places and added. Then SIXTEEN, THIRTY-TWO or
whatever, to the full accuracy of the Mantissa.
This multiplies the number by 0.000110011001100110011001100
by the following stages:
0.00011
This shows exponential growth of the number of places of accuracy.
As 0.00011001100110011001100110011 is one TENTH, we have divided
on the spot by mutiplying by one tenth.
However, it was really a multiplication-on-the-spot by
1.10011001100110011001100110011
This is 1.6, so we need to divide by 16 by deducting FOUR from the Exponent.
If any of the stages had overflowed, however, we would have needed to
shift right to catch the carry bit. Then we would have deducted only
three from the Exponent.
This is done to the Mantissa whilst the Exponent remains positive,
and each time the "DECPLACES" is incremented.
So now we can print either 123.456789 or 1.23456789E2 or 1.23456789E+2
by examining the size of "DECPLACES".
Having evaluated the DECPLACES, and provided routines to handle them,
we construct the shift-and-add printout (which gave us 99609375) in
such a way as to tie it all together.
This fast system is essential when a giant mathematical system is built,
because it would otherwise be unbearably slow. The 65-place mathematics
printed out as fast as many 7-place systems.
This print-out also gives us experience of handling floating point.
Simulated "packages" of floating point can be fed to it, to gain experience.
A refinement to the print-out routine was the rounding. The Mantissa of
0.4294967295 is 1844674407. A Mantissa that is not zero and does not exceed
this will print out to ten places. A Mantissa above this will deliver nine.
Therefore, a small number representing a five in the tenth place was added
to Mantissae above this number. This ensured that on the nine-figure numbers
rounding-up would take place as often as rounding-down. Here, rounding-up
means that a number having, say 3 in the ninth place and 5 or more in the
tenth would get a 4 in the ninth place instead. Rounding down means that
a number ending in a 3 and a number below 5 would keep the 3.
By careful adjustment, it was possible to ensure that after a single
rounding operation, the readout and read-in routines did not cause the
accuracy to degrade any further.
The 9.63 place arithmetic was designed for an eight-place readout,
as with most pocket calculators. That meant that it had 1.63 places of
hidden precision.
The 9/10 place full precision readout was for compiling and decompiling
purposes. In short, for the handling of the programs.
Read-in
Take an integer, such as 9. Add it into the four-byte Mantissa:
00000000 00000000 00000000 00001001
Multiply by shift-and-add (10)
00000000 00000000 00000000 01011010
Add 9:
multiply by 10 and add 6. Then multiply and add nothing, then 9,3,7 and 5.
00000000 00000000 00000011 11100100 (996)
This is four bits short of left-justified:
10011110 11111110 10110001 11110000 (99609375 times 16)
If we had used the number 1 instead of 99609375, we would have needed
31 shifts to left-justify it. So a 1 can be floated by taking OFFSET+31
and decrementing it for each left shift until justified. That leaves
just the offset (32768 or 128).
Similarly, the 99609375 starts at OFFSET+31, and loses 4. That makes
OFFSET+27. The eight decimal places have been converted into 27 binary
places in the Exponent byte.
Counting out the number of decimals that are assimilated after a decimal
point or comma was detected allows us to post-process to obtain numbers
such as 0.99609375 . This is done by the "on-the-spot" routines.
Given a read-out and a read-in routine, we can proceed to create a
four-function calculator.
Add
Look at the two "Signa" (signs), which are usually the leading bits
of the Mantissae. XOR one against the other. Is the result a zero?
Then use the addition to be described.
Add by Addition
Shift the smaller Mantissa right by that number of places. Add it to
the other Mantissa. Did it carry? If so, shift the result right to
rescue the bit, and increment the Exponent of the larger number.
Put the saved signum in place of the top bit of the result Mantissa.
The larger Exponent and the Mantissa are now the result.
Exponent overflow must be trapped by a routine.
Add by Subtraction
Save the signum of ONE. Set both Mantissa-leading-bits. Compare the Exponents.
Which is smaller, and by how much?
If the signum saved is of the smaller, reverse it.
Shift the smaller Mantissa right by that number of places. Subtract it from
the other Mantissa. Was it zero? If so, jump to a subroutine that delivers
FP-ZERO (usually all bytes zero).
Is the larger Mantissa still left justified? Then transfer the signum and
deliver it with its Exponent as the result.
If not left justified, keep shifting it left until it is, decrementing the
Exponent and checking for Exponent-underflow (tiny numbers jump to FP-ZERO).
Subtract
Multiply
Set the leading bit of both Mantissae.
NOTE THAT IN FORTH (THE ONLY SCIENTIFIC STUDY OF SUBROUTINES) THE MAIN
MULTIPLY IS UM* ("you-em-star", no offence meant). THIS TAKES TWO SINGLE
CELLS AND CREATES A DOUBLE CELL. SIMILARLY, ON THE PENTIUM, IF EAX IS A
32-BIT CELL, AND SO IS EBX, RESULT EDX:EAX OF MUL EBX IS A DOUBLE CELL.
"LEADING" MULTIPLY, FOR FLOATING POINT, TRIES TO LEFT-JUSTIFY EDX:EAX BEFORE
DROPPING EAX. "TRAILING" MULTIPLY, FOR INTEGERS, LOOKS AT EDX FOR OVERFLOW
AND PRESERVES ONLY EAX.
Multiply Mantissae together. If EDX is left-justified, simply overwrite the
leading bit with the final Signum.
If EDX is not left-justified, there will ONLY be a single leading zero.
Shift left or add the double-register to itself. Decrement one Exponent.
Overwrite the top bit with the Signum.
Add together the Exponents, and subtract one offset. Is the Exponent in
range, or is it too big or too small? If too small, jump to FP-ZERO. If
too big, use either a routine to report "FP OVERFLOW", or reserve just
one code - such as
Divide
However, there is a more advanced technique discovered by the author, and
possibly by others, in which the add-back described above is not needed.
This makes the process faster. It will be described here.
But first:
Load the Exponent of the divisor, and check it for zero. Jump to a "division
by zero" routine (error halt or whatever) if zero.
My own preference for an offset of 128 or 32768 shows now. We subtract the
(larger) divisor-Exponent from the (smaller) dividend-Exponent. This
automatically creates the correct quotient-Exponent and a carry. If the
carry does not occur, there has been an overflow.
Xor the Signa for the Signum of the result.
Set the leading bit of both Mantissae.
Subtract the Exponent of the divisor from the Exponent of the dividend.
The offsets will cancel each other, so it wastes time first to subtract
offsets and then subtract.
Then:
Consider the number 10011110 11111110 10110001 11110000
(99609375 times 16)
This is put into a double register as dividend:
10011110 11111110 10110001 11110000 00000000 00000000 00000000 00000000
Let us assume that the divisor has a Mantissa of one and a quarter:
10100000 00000000 00000000 00000000
We subtract the divisor from the dividend:
11111110 11111110 10110001 11110000 (borrow)
now we know that we must add back that one-and-a-quarter. However, we
can do this by adding back only half, then a quarter, then an eighth
and so on until the routine ends.
The borrow made the program jump to the "divide by addition" loop at
a point where the carry flag is reset and the entire double register is
shifted left (doubled) with carry (borrow):
11111101 11111101 01100011 11100000 00000000 00000000 00000000 00000000
The loop counter is counted down from 32 to 31 because there must be 32
left shifts in total.
Now the one-and-a-quarter is added back to the doubled dividend. This is
the same as adding half of that divisor to that dividend:
10011101 11111101 01100011 11100000 00000000 00000000 00000000 00000000
It now carries. So we must jump to the "divide by subtraction" routine
at a point where the carry flag is set and the entire double register is
shifted left (doubled) with carry (borrow):
00111011 11111010 11000111 11000000 00000000 00000000 00000000 00000001
We can see that at the next subtraction it will again spring to the
"divide by addition" loop:
10011011 11111010 11000111 11000000 00000000 00000000 00000000 00000001
00110111 11110101 10001111 10000000 00000000 00000000 00000000 00000010
What emerges is that the remainder is formed in the leftmost four bytes
whilst the quotient is formed in the rightmost four bytes.
However, if the 32-times counter counts to nil whilst the program
counter is in the "divide by addition" loop, there will be a fragment
of remainder missing. So it must leap out of the "divide by addition"
loop via an "add divisor to dividend" instruction.
To understand that, imagine that we owed a number and paid back a HALF,
then a QUARTER, then an EIGHTH of it. If we pay back another EIGHTH we have
paid it all back. If it had been four stages, we would be paying back
a SIXTEENTH. So for any number of stages we are always paying back in
full.
If the countdown ends in the "divide by subtraction" loop, it just
leaps out without that final addition.
If only the quotient is needed, as in the creation of the mantissa of
the floating-point quotient, we do not bother to have the final addition
after the "divide by addition" loop. This saves a couple of clock cycles.
It will be seen that if the carry (borrow) is not set by the loops after
addition or subtraction (whichever it may be), the topmost bit is always
zero so that the shifting of the quotient is delivering a string of zeroes
into the carry.
However, when the carry (borrow) is set by the addition or subtraction,
the leading bit is set. However, the "divide by addition" is the complement
of the "divide by subtraction" routine and vice-versa. Thus, after the
leap to the complementary loop, the discarding of the 1 into the carry is
equivalent to discarding a zero into the carry.
At this point, we check the top bit of the quotient we have generated.
If it is reset, we shift the quotient left (double it) and decrement the
quotient-Exponent. If that Exponent goes to zero, the number is too small,
so we jump to a routine that creates FP-ZERO. In this case it needs only
to zero the Mantissa.
The lowest Mantissa is
10000000 00000000 00000000 00000000,
which represents unity. the highest is
11111111 11111111 11111111 11111111,
which is just short of 2. Division of the lowest by the highest cannot
therefore give a number below a half. Accordingly, the top bit alone
might be reset, but two or more bits will not be. We need to check the
top bit only once therefore.
Scientific Functions
That is how the author's many mathematical systems were developed.
"The Inner Core of Floating Point" focuses mainly on the finer details of
transcendental functions - the means to achieve rapid "convergence" of
a "polynomial". However, both the Exponent and Mantissa are described.
The details given above describe not only the basic four functions, read in
and print out, but also the means of "packing" the floating point and means
of achieving speed.
When the system described has been written, the high-speed subroutines will
be available to the transcendental functions as needed.
Logic Functions including Fractional
"Fint" - floating integer - might be used to find the integer part of pi:
11001001 00001111 11011010 10100000 10000000 00000010
We see from the Exponent on the right (big-endian) that it has an offset
of 32768 and a "shift left" of two. So we make a mask for two bits:
11000000 00000000 00000000 00000000
This is combined by means of the AND function with the original Mantissa.
So there is your three:
"Ffrac" - floating fraction - is made using the complementary mask:
00111111 11111111 11111111 11111111
When the AND function is used, the zero flag will be set if the fractional
part is zero, allowing a leap to FP-ZERO. With pi this does not happen:
00001001 00001111 11011010 10100000
10000000 00000010
This is not left-justified, so we need to shift the Mantissa left until
it is. It requires four shifts:
10010000 11111101 10101010 00000000
011111111 11111110
Shifting left doubles the number, so for each shift we decremented the
Exponent. This left an exponent of 32766, which is below 32768 and therefore
represents a fraction.
This number is 0.141592653
Before floating-point logic functions are implemented, the lesser argument must
be shifted right and its Exponent incremented until the two Exponents match.
"Fand" - floating-point AND - was made by adding a 2 to the Mantissa of each
number. That is for rounding. If it overflows, the carry is shifted back, the
other argument also shifted right and the Exponent incremented. Both numbers are
then masked to clear the bottom two bits. Then the CPU's AND function is applied.
The result would be that if we write
the Forth system would give
1.75 OK>
However, if we write
we get
1.0 OK>
That is the same as conventional integer-logic AND.
And if we write
we get
0.75 OK>
All the usual logic functions, OR, EXOR and so on, were written in the
way described, giving fractional logic to 30 binary places.
The loss of two bits was essential to prevent the system from seeming
to have gone wrong. For example, 0.9999999999 would be rounded up by
the printout routine. It would look like "1.0".
"1.0 1.0 fand f." should give
"1.0".
However, if the first "1.0" is secretly 0.9999999999 and the other a genuine
1.0, a wrong result would emerge. So all numbers were rounded for safety.
Many other novelties were created, including a search for rational numbers.
Consider the number 7.33333333. Is it rational?
We take a copy, and use Ffrac. This delivers 0.33333333. We find the
reciprocal, which is 3. We multiply the original by a copy of the 3.
This delivers 22 and 3. So 7.33333333 is 22 divided by 3.
This technique was implemented on the giant laboratory mathematical
system of 224 bits (67.43 decimal places displaying as 65 with 2.43 hidden).
The reason for the absurd precision can be seen. With a mathematical system
of this enormity, it would be very rare that a number is mistaken for an
integer when it is not.
Rounding
When printing out to eight decimal places, the number 21 was added on to encouraqe
rounding to eight places. A smaller number, 2, was used when rounding to nine.
30 Bits
The reciprocal might be 3.00000001
Clearing the bottom bits sets it to 3. Accuracy is still correct to one
part in 1,073,741,824.
Efficiency
Each instruction was examined to see if it could do several things at
once. For example. if one subtracts the offset one can already use two
conditional jumps. JZ (jump if zero) would jump, perhaps, to a printout
routine that uses no shifts, bypassing the multiplication-on-the-spot.
JC (jump if carry) would jump to a routine that prints numbers below unity
(shifting right). No jump would lead to a routine that prints numbers by
shifting left.
Most numbers would print by shifting left or right. Therefore, for
preference one uses the JC before the JZ to obtain a trace more long-term
efficiency.
Advanced Central Processors
However, for total mastery of floating-point it is essential to know how
to write a multiplication and division routine. Most pocket calculators
have very simple processors inside them, so that all functions are in
software. Competence across the spectrum leads to confidence when faced
with any unfamiliar processor.
When writing for a Pentium one would, of course, simply use the MUL and
DIV instructions. Writing software to replace fast hardware which exists
simply wastes the power of the machine.
|
(C) 1983-2008 Charles Douglas Wehner. |