Saturday, October 8, 2011

How to represent floating number in binary mode?

The answer to the OP's question can be found in the C
standard (to which the C++ standard delegates) and is
reasoned on the following *symmetric* floating-point
model, described in (C99) 5.2.4.2.2:
x = s b^e Sum[f_k b^(-k), {k, 1, p}]
where
x: Any floating-point number
s: Sign (+1 or -1)
b: Base/radix of exponent repr.
e: Exponent, where e_min <= e <= e_max
p: Precision
f_k: Nonnegative integer digits
using Mathematica notation as described in
http://documents.wolfram.com/mathematica/functions/Sum
and where _ and ^ denote a subscripts and a
superscript, resp.
This obvious symmetry of floating point values
relative to the sign is also demanded *not* to be
disturbed by subnormals, infinities and/or NaN's,
which follows by further constraints described in
5.2.4.2.2/3

No comments:

Post a Comment