Floating Point Numbers

References

Optional texts:

"What Every Computer Scientist Should Know About Floating-Point Arithmetic" , David Goldberg, 1991
"Anatomy of a floating point number" , John D Cook, 2009
"Accuracy and Stability of Numerical Algorithms" , Hingham, 2002

Outline

Positional Numeral System
Floating point format
Errors
Operations, fundametal axiom

Positional Numeral System

Ingredients
1. A base $\beta\in\br{2,3,4...}$
2. A sequence of digits, $d_0,d_1,...,\in\br{0,1,...,\beta-1}$
3. Exponent $e\in\Z$

representation: $d_0 d_1 d_2 ... \times \beta^e$

Homework

Can all nonnegative real numbers be represented in such a manner for an arbitrary base $\beta\in\br{2,3,...}$ ?
Suppose $e=-1$, what are the range of numbers that can be represented for an arbitrary base $\beta$ ?
Characterize the numbers that have a unique representation in a base $\beta$ (iff condition+proof)

Floating Point Format

Definition. A floating point is one that can be represented in a base $\beta$ with a fixed digits $p$ (precision), and with exponent $e\in\br{e_{\min},e_{\max}}$

Example. $\beta=10, p=3, e_{\min}=-1, e_{\max}=1$. Then, $0.1$ can be represented as

d0  d1  d2  e
0   0   1   1
0   1   0   0
1   0   0   -1

Definition. A normalized floating point number has $d_0\neq0$

Example. The total number of values that can be prepresented in normalized floating point format with $\beta,p,e_{\min},e_{\max}$?

Answer. $\beta^{p-1}(\beta-1)(e_{\max}-e_{\min}+1)$ ?

Homework

Take decimal number, base, precision, return closest approximation of normalized floating point with vector of digits, exponent
given base, precision, $e_{\min},e_{\max}$, list all normalized floating point representable

Floating Point Approximations

A IEEEE 16-bit (Half-precision standard) has
- 1 bit for sign
- 5 bits exponent 00000 and 11111 reserved for 0 and $\infty$, 30 exp
- 11 significant (only 10 stored since leading bit must be 1 since normalized)

Question. What are smallest and largest positive numbers that can be represented?

Answer.
- Exponents: $2^5 = 32$ exponents. Minus $0$ and $\infty=30$ possible exponents, $-14,..,15$.
- Smallest non-normalized: $(0+0(2^{-1})...0(2^{-9})+1(2^{-10}))\times2^{-14}=2^{-24}\approx5.96\times10^{-8}$
- Smallest normalized: $(1+0(2^{-1})...0(2^{-9})+0(2^{-10}))*2^{-14}=2^{-14}\approx6.1\times10^{-5}$
- Largest finite: $(1+1(2^{-1})...1(2^{-9})+1(2^{-10}))\times2^{15}=65504$

Note. Julia has implementation allowing non-normalized numbers, and somehow allows 11-bits of digits.

A IEEEE 64-bit (Double-precision standard) has
- 1 bit for sign
- 11 bits exponent 0...0 and 1...1 reserved for 0 and $\infty$
- 11 significant (only 10 stored since leading bit must be 1 since normalized)

Homework

(Look up 64-bit standard) What is smallest normalized and non-normalized positive value? Largest noninfinite value?
Is there a general formula for determining the largest positive value for a given base $\beta$, precision $p$, and $e_{\max}$?
Verify the smallest non-normalized positive number that can be represented? Longest noninfinite?

Errors

Units in the Last Place (ULP) eps
Absolute+Relative Error

$\fl:\R_{20}\to S$
- Absolute Error: $|\fl(z)-z|$
- Relative Error: $z^{-1}|\fl(z)-z|$

Lemma. If $z$ has exponent $e$, the max absolute error is $\tfrac12\beta^{e-p+1}$
Proof. : homework

Lemma. If $z$ has exponent $e$, the relative error is bounded above by $\tfrac12\beta^{1-p}$

Proof. if $z$ has exponent $e$, $\beta^e\in\Z$,
$$\tfrac12|\fl(z)-z|\leq\frac{\beta^{e-p+1}}{2\beta^e}=\tfrac12\beta^{1-p}~~~\leftarrow\text{(eps)}$$

$\varepsilon=\tfrac12\beta^{1-p}$ called machine epsilon

Homework

What happens if we consider negative numbers?

Operations

Fundamental Axiom

For any $\operatorname{op}$ of the 4 arithmetic operations $(+-\times\div)$ we have the following bound
$$\fl(x \op y)=(x\op y)(1+\delta)$$

where $|\delta|\leq u$ where $u$ is commonly $2\varepsilon$

Matrix Storage $A\in\R^{m\times n}$
$|\fl(A)-A|\leq u|A|$

Homework question:
1. Show that $\|A\|$ is equal to maximum of the $\ell^1$ norms of the columns of $A$
2. Show that $\|A\|_\infty$ is equal to maximum of the $\ell^1$ norms of the rows of $A$
3. $\|\fl(A)-A\|_p\leq u\|A\|_p$

Dot Product

Let $X,y\in\R^n$, $\fl(x'y)-x'y$

$x'y=\sum x_iy_i$ pseudo-code:

function flDot(x,y)
    length(x) == length(y) || error("wrong dim")
    s = 0
    for i in 1..n
        s += x[i]*y[i]
    end
    return s

Lemma Let $x,y\in\R^n$. Assuming the fundamental axiom holds and $nu<0.01$ then
$$|\fl(x'y)-x'y|\leq 1.01 nu|x|'|y|$$

Proof Let $\delta_p$ denots $s$ after $p$ iterations $s_1=\fl(x_1y_1)=(x_1y_1)(1+\delta_1)$, $|\delta_1|\leq u$
$s_p=\fl(s_{p-1}+\fl(x_p+y_p))=(s_{p-1}+\fl(x_p+y+p))(1+\epsilon_p)$, $|\epsilon_p|\leq u$

$s_p=(s_{p-1}+x_py_p(1+\delta_p))(1+\epsilon_p)$, $|s_p\|,|\epsilon_p|\leq u$

Let $\epsilon_1=0$
$s_1=(x_1y_1)(1+\delta_1)$
$s_2=(s_1+x_2y_2(1+\delta_2))(1+\epsilon_2)=x_1y_1(1+\delta_1)(1+\epsilon_2)+x_2y_2(1+\delta_2)(1+\epsilon_2)$
$s_3=(s_2+x_2y_2(1+\delta_3))(1+\epsilon_3)$
$=x_1y_1(1+\delta_1)(1+\epsilon_1)(1+\epsilon_2)(1+\epsilon_3)+x_2y_2(1+\delta_2)(1+\epsilon_2)(1+\epsilon_3)+x_3y_3(1+\delta_3)(1+\epsilon_3)$
$s_p=\sum_{i=1}^px_iy_i(1+\delta_i)\prod_{j=1}^p(1+\epsilon_j)$
$|s_n-x'y|=|\sum_{i=1}^n(x_iy_i)\br{1-(1+\delta_i)\prod_{j=1}^n(1+\epsilon_j}|\leq\sum_{i=1}^n|x_iy_i\|\gamma_y|$

Lemma. If $|\delta_i|\leq u$ for $i=1,..,n$ s.t. $nu<2$ let $1+\eta=\prod_{i=1}^n(1+\delta_i)$, then $$|\eta|\leq\frac{nu}{1-nu/2}$$

Proof. HOMEWORK $|\eta|=|\prod_{i=1}^n(1+\delta_i)-1|\leq(1+u)^n-1$
hint : show $1+u\leq\exp(u)$, let $g(u)=\exp(u)-1-u$, look at $0$ and derivative.

Then, using this lemma,
$|\eta|\leq\exp(nu)-1$
$\leq nu+\frac{(nu)^2}{2!}+\frac{(nu)^3}{3!}+..$
Then, $|\eta|\leq nu\left(\frac1{1-nu/2}\right)$

Then, previous inequality $\sum_{i=1}^n|x_iy_i\|\gamma_y|\leq\frac{nu}{1-nu/2}\sum_{i=1}^n|x_iy_i|\leq\frac{nu}{0.995}\sum_{i=1}^n|x_iy_i|$