Optional texts:
Ingredients
1. A base $\beta\in\br{2,3,4...}$
2. A sequence of digits, $d_0,d_1,...,\in\br{0,1,...,\beta-1}$
3. Exponent $e\in\Z$
representation: $d_0 d_1 d_2 ... \times \beta^e$
Definition. A floating point is one that can be represented in a base $\beta$ with a fixed digits $p$ (precision), and with exponent $e\in\br{e_{\min},e_{\max}}$
Example. $\beta=10, p=3, e_{\min}=-1, e_{\max}=1$. Then, $0.1$ can be represented as
d0 d1 d2 e
0 0 1 1
0 1 0 0
1 0 0 -1
Definition. A normalized floating point number has $d_0\neq0$
Example. The total number of values that can be prepresented in normalized floating point format with $\beta,p,e_{\min},e_{\max}$?
Answer. $\beta^{p-1}(\beta-1)(e_{\max}-e_{\min}+1)$ ?
A IEEEE 16-bit (Half-precision standard) has
- 1 bit for sign
- 5 bits exponent 00000 and 11111 reserved for 0 and $\infty$, 30 exp
- 11 significant (only 10 stored since leading bit must be 1 since normalized)
Question. What are smallest and largest positive numbers that can be represented?
Answer.
- Exponents: $2^5 = 32$ exponents. Minus $0$ and $\infty=30$ possible exponents, $-14,..,15$.
- Smallest non-normalized: $(0+0(2^{-1})...0(2^{-9})+1(2^{-10}))\times2^{-14}=2^{-24}\approx5.96\times10^{-8}$
- Smallest normalized: $(1+0(2^{-1})...0(2^{-9})+0(2^{-10}))*2^{-14}=2^{-14}\approx6.1\times10^{-5}$
- Largest finite: $(1+1(2^{-1})...1(2^{-9})+1(2^{-10}))\times2^{15}=65504$
Note. Julia has implementation allowing non-normalized numbers, and somehow allows 11-bits of digits.
A IEEEE 64-bit (Double-precision standard) has
- 1 bit for sign
- 11 bits exponent 0...0 and 1...1 reserved for 0 and $\infty$
- 11 significant (only 10 stored since leading bit must be 1 since normalized)
$\fl:\R_{20}\to S$
- Absolute Error: $|\fl(z)-z|$
- Relative Error: $z^{-1}|\fl(z)-z|$
Lemma.
If $z$ has exponent $e$, the max absolute error is $\tfrac12\beta^{e-p+1}$
Proof.
:
homework
Lemma. If $z$ has exponent $e$, the relative error is bounded above by $\tfrac12\beta^{1-p}$
Proof.
if $z$ has exponent $e$, $\beta^e\in\Z$,
$$\tfrac12|\fl(z)-z|\leq\frac{\beta^{e-p+1}}{2\beta^e}=\tfrac12\beta^{1-p}~~~\leftarrow\text{(eps)}$$
$\varepsilon=\tfrac12\beta^{1-p}$ called machine epsilon
What happens if we consider negative numbers?
For any $\operatorname{op}$ of the 4 arithmetic operations $(+-\times\div)$ we have the following bound
$$\fl(x \op y)=(x\op y)(1+\delta)$$
where $|\delta|\leq u$ where $u$ is commonly $2\varepsilon$
Matrix Storage $A\in\R^{m\times n}$
$|\fl(A)-A|\leq u|A|$
Homework question:
1. Show that $\|A\|$ is equal to maximum of the $\ell^1$ norms of the columns of $A$
2. Show that $\|A\|_\infty$ is equal to maximum of the $\ell^1$ norms of the rows of $A$
3. $\|\fl(A)-A\|_p\leq u\|A\|_p$
Let $X,y\in\R^n$, $\fl(x'y)-x'y$
$x'y=\sum x_iy_i$ pseudo-code:
function flDot(x,y)
length(x) == length(y) || error("wrong dim")
s = 0
for i in 1..n
s += x[i]*y[i]
end
return s
Lemma
Let $x,y\in\R^n$. Assuming the fundamental axiom holds and $nu<0.01$ then
$$|\fl(x'y)-x'y|\leq 1.01 nu|x|'|y|$$
Proof
Let $\delta_p$ denots $s$ after $p$ iterations $s_1=\fl(x_1y_1)=(x_1y_1)(1+\delta_1)$, $|\delta_1|\leq u$
$s_p=\fl(s_{p-1}+\fl(x_p+y_p))=(s_{p-1}+\fl(x_p+y+p))(1+\epsilon_p)$, $|\epsilon_p|\leq u$
$s_p=(s_{p-1}+x_py_p(1+\delta_p))(1+\epsilon_p)$, $|s_p\|,|\epsilon_p|\leq u$
Let $\epsilon_1=0$
$s_1=(x_1y_1)(1+\delta_1)$
$s_2=(s_1+x_2y_2(1+\delta_2))(1+\epsilon_2)=x_1y_1(1+\delta_1)(1+\epsilon_2)+x_2y_2(1+\delta_2)(1+\epsilon_2)$
$s_3=(s_2+x_2y_2(1+\delta_3))(1+\epsilon_3)$
$=x_1y_1(1+\delta_1)(1+\epsilon_1)(1+\epsilon_2)(1+\epsilon_3)+x_2y_2(1+\delta_2)(1+\epsilon_2)(1+\epsilon_3)+x_3y_3(1+\delta_3)(1+\epsilon_3)$
$s_p=\sum_{i=1}^px_iy_i(1+\delta_i)\prod_{j=1}^p(1+\epsilon_j)$
$|s_n-x'y|=|\sum_{i=1}^n(x_iy_i)\br{1-(1+\delta_i)\prod_{j=1}^n(1+\epsilon_j}|\leq\sum_{i=1}^n|x_iy_i\|\gamma_y|$
Lemma. If $|\delta_i|\leq u$ for $i=1,..,n$ s.t. $nu<2$ let $1+\eta=\prod_{i=1}^n(1+\delta_i)$, then $$|\eta|\leq\frac{nu}{1-nu/2}$$
Proof.
HOMEWORK
$|\eta|=|\prod_{i=1}^n(1+\delta_i)-1|\leq(1+u)^n-1$
hint
: show $1+u\leq\exp(u)$, let $g(u)=\exp(u)-1-u$, look at $0$ and derivative.
Then, using this lemma,
$|\eta|\leq\exp(nu)-1$
$\leq nu+\frac{(nu)^2}{2!}+\frac{(nu)^3}{3!}+..$
Then, $|\eta|\leq nu\left(\frac1{1-nu/2}\right)$
Then, previous inequality $\sum_{i=1}^n|x_iy_i\|\gamma_y|\leq\frac{nu}{1-nu/2}\sum_{i=1}^n|x_iy_i|\leq\frac{nu}{0.995}\sum_{i=1}^n|x_iy_i|$