May 28, 2016

It is well known in all programming languages the floating point math loses some accuracy. Therefore we have something like the following statement

1

0.1 + 0.2 != 0.3

This is explained in detail in here, but basically, is because some fractional numbers aren’t well represented on binary, and the carryover loses accuracy, but it keeps precise.

There are several ways to mitigate this effect, the most known use double numbers or big decimal representations, the problem with those is the operations tend to use more computing resources, especially with BigNumbers, that usually are objects with large strings and complicated math behind.

Let’s assume we are representing money if we choose to store float values in our database, in that case, our business will lose money. If we use a double precision data type we gain accuracy, with a little price of resources but is not enough, our company cannot lose any cent. Using BigData it will solve the problem, but to perform calculations, more computing resources are required. The best solution for this case is by using cents and represent everything on integers(computers work better with integers), and we will never have half cents in any currency.

In statistics or graphics, having a 100% accuracy and precision do not make any difference. Because from the human perspective is the same to have 40.54 to 40.5349 and in graphics, those small differences aren´t perceptible. As long as we keep our metrics close to the reality and humanly readable, we can use float values, without hesitating.

As in an article recent published here, we can see how the NASA it only uses 14 decimals of Pi to send a man to the moon. That is very impressive, although some developers think that, having more than 14 decimals are not enough for the calculations, especially if our application gather information from sensors. But we still need high accuracy and precision, the question is, how much?

I generate an SQLite database with around 1.5M data points and copy the values in four different data types:

- Integer: using the four first decimals
- Integer: using the eight first decimals
- Real: using the float value
- Text: as it is from the source to be parsed within a BigDecimal

The simplest operation from a computer is the addition, so for each data type, it will be added the entire data points and compared with the big decimal value, to estimate the error. For the integer values, after the addition, a division it’s needed to restore the float type and get the closest approximation, to the real value.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

require 'sqlite3'
require 'bigdecimal'
require 'benchmark/ips'
DB = SQLite3::Database.new 'data.db'
def sum_int
rows = DB.execute("SELECT value_int FROM points")
rows.flatten.map(&:to_i).inject(:+).to_i / 10_000.0
end
def sum_big_int
rows = DB.execute("SELECT value_big_int FROM points")
rows.flatten.map(&:to_i).inject(:+).to_i / 100_000_000.0
end
def sum_real
rows = DB.execute("SELECT value_real FROM points")
rows.flatten.map(&:to_f).inject(:+)
end
def sum_txt
rows = DB.execute("SELECT value_txt FROM points")
rows.flatten.map{ |x| BigDecimal.new(x) }.inject(:+)
end
real_value = sum_txt.to_f
puts "SUM error = #{(real_value - real_value) / real_value}, text = #{sum_txt.to_f}"
puts "SUM error = #{(real_value - sum_real) / real_value}, real = #{sum_real}"
puts "SUM error = #{(real_value - sum_big_int) / real_value}, bint = #{sum_big_int}"
puts "SUM error = #{(real_value - sum_int) / real_value}, int = #{sum_int}"
Benchmark.ips do |x|
x.config(time: 10, warmup: 2)
x.report("sql value_int") { sum_int }
x.report("sql value_big_int") { sum_big_int }
x.report("sql value_real") { sum_real }
x.report("sql value_txt") { sum_txt }
x.compare!
end

Using integers can improve our calculations but not for much, and the lost precision can be compensated if our values aren’t critical, otherwise using float type it will be just enough. We still can use BigDecimal values, but the cost in the long term run could affect our performance, I don’t recommend use this data type if is not related to something critical in our application. Although we still calculate parabolic trajectories with float data type.