The answer to the OP's question can be found in the C

standard (to which the C++ standard delegates) and is

reasoned on the following *symmetric* floating-point

model, described in (C99) 5.2.4.2.2:

x = s b^e Sum[f_k b^(-k), {k, 1, p}]

where

x: Any floating-point number

s: Sign (+1 or -1)

b: Base/radix of exponent repr.

e: Exponent, where e_min <= e <= e_max

p: Precision

f_k: Nonnegative integer digits

using Mathematica notation as described in

http://documents.wolfram.com/mathematica/functions/Sum

and where _ and ^ denote a subscripts and a

superscript, resp.

This obvious symmetry of floating point values

relative to the sign is also demanded *not* to be

disturbed by subnormals, infinities and/or NaN's,

which follows by further constraints described in

5.2.4.2.2/3

## Saturday, October 8, 2011

### How to represent floating number in binary mode?

Subscribe to:
Post Comments (Atom)

## No comments:

## Post a Comment