Using the right data type for some calculation sounds like some obvious advice.
There are many blogs about using temporal data types for temporal data, instead of strings. An obvious reason is data integrity and correctness. We don’t gain much in storing dates as
Clik here to view.
… your trust level for the platform immediately goes down by factor
So, we have 100000 records, whose
2019-09-10
in one record, and as Nov 10, 2019
in the next one. Performance is also important in this case – for example, because of statistics being more optimal (an often overlooked side effect of unfit data types).
But at a customer site, I’ve recently discovered a surprising (not surprising in hindsight) performance issue when using NUMBER instead of BINARY_DOUBLE in an Oracle database.
NUMBER for monetary amounts
TheNUMBER
type (or DECIMAL
, NUMERIC
in other RDBMS) is perfectly suited for all decimal numbers, which require the correct precision and rounding. I mean, if you ever encounter this kind of display in an invoice, such as when I purchase video games on steam:
Image may be NSFW.Clik here to view.

1.0e+1
(even if technically, it does not matter in this case). So, the default, e.g. in banking systems for monetary amounts is always to use NUMBER
or an equivalent type, e.g. java.math.BigDecimal
.
Calculations on numbers
But now, let’s assume we want to do some statistics on these amounts. Some aggregations of various numeric values, maybe even of their logarithms. We can establish that these two expressions are equal:ln(a * b) = ln(a) + ln(b)Or in other words, for positive numbers:
a * b = exp(ln(a) + ln(b))We’ve already blogged about how this approach can be used to emulate a
PRODUCT()
aggregate function in SQL, which is very useful for some cases, but none of the SQL databases jOOQ supports has built-in support for this yet. Notice, the blog post also takes care of zero and negative numbers.
But which number type to choose?
Now, we might be tempted to just calculate theLN(SOME_NUMBER)
value, and sum that up using SUM(LN(SOME_NUMBER))
for this use-case. This turns out to be terribly slow in Oracle. We were thinking of bad indexes, first, even doubted aggregation in general, until I suggested we try using BINARY_DOUBLE
instead, in this case. In our case, we didn’t care about the exact numeric value of the amount. A IEEE 754 floating point number with double precision was going to be good enough.
The results were surprising. In a simple benchmark, we compared 10 approaches to calculating this sum of logarithms:
- Using
NUMBER(20, 10)
andSUM(LN(the_number))
- Using
NUMBER(20, 10)
andSUM(LN(CAST(the_number AS BINARY_DOUBLE)))
- Using
NUMBER(20, 10)
andSUM(LN(TO_BINARY_DOUBLE(the_number)))
- Using
NUMBER(20, 10)
, containing a pre-calculated LN value - Using
DOUBLE PRECISION
andSUM(LN(the_double))
- Using
DOUBLE PRECISION
andSUM(LN(CAST(the_double AS BINARY_DOUBLE)))
- Using
DOUBLE PRECISION
andSUM(LN(TO_BINARY_DOUBLE(the_double)))
- Using
DOUBLE PRECISION
, containing a pre-calculated LN value - Using
BINARY_DOUBLE
andSUM(LN(the_binary))
- Using
BINARY_DOUBLE
, containing a pre-calculated LN value
- We tried the above 3 possible numeric data types, expecting
BINARY_DOUBLE
to be the fastest - We tried to pre-calculate the
LN()
value for the benchmark, to see how much effort goes into summing, and how much effort goes into theLN()
calculation with each type. While in general, in this system, such precalculation is impractical, we still wanted to have a benchmark comparison, in case a materialised view or other technique would be feasible. - We tried casting and converting each type to
BINARY_DOUBLE
prior to passing the value to theLN()
function. Instead of migrating all the data (with possible side effects), we wanted to see if we can solve this to a reasonable extent “on the fly”
CREATE TABLE data (
n1 NUMBER(20, 10),
n2 NUMBER(20, 10),
d1 DOUBLE PRECISION,
d2 DOUBLE PRECISION,
b1 BINARY_DOUBLE,
b2 BINARY_DOUBLE
);
INSERT INTO data
SELECT level, ln(level), level, ln(level), level, ln(level)
FROM dual
CONNECT BY level <= 100000;
SUM(LN(x))
we want to calculate in the above 10 different ways. N1
contains the raw numeric value, and N2
contains the pre-calculated LN(N1)
.
The benchmark technique is described here. Do note it has a lot of caveats and is only useful for a limited number of verifications. Please always be careful when running such benchmarks, they often do not test production-like situations – and never use such benchmarks to compare different RDBMS products directly. They are only useful to compare two approaches on the same RDBMS product.The results as run on Oracle 18c XE in Docker on development hardware are below. Times are compared relative to the fastest run, as actual time spent in each execution is not interesting for comparison. We did have similar results on production-like hardware, though, and I’m sure you can produce similar results in other RDBMS:
NUMBER(20, 10) ------------------------------------------------- Run 3, Statement 1 : 280.34143 (avg : 280.75347) Run 3, Statement 2 : 7.99402 (avg : 8.03506) Run 3, Statement 3 : 7.71383 (avg : 7.73017) Run 3, Statement 4 : 1.05456 (avg : 1.11735) DOUBLE PRECISION ------------------------------------------------ Run 3, Statement 5 : 278.89476 (avg : 279.72981) Run 3, Statement 6 : 8.06512 (avg : 8.07033) Run 3, Statement 7 : 7.81873 (avg : 7.80063) Run 3, Statement 8 : 1.5315 (avg : 1.54347) BINARY_DOUBLE ------------------------------------------------ Run 3, Statement 9 : 2.4963 (avg : 2.57184) Run 3, Statement 10: 1 (avg : 1.02943)How to read these results?
- Statement 10 is the fastest one, unsurprisingly, as it aggregates pre-calculated
LN(binary_double)
values. The precalculation of the function means that all the work has been done already before the report, and the data type is the one we expected to perform best in general - Statements 4 and 8 are almost as fast (precalculated
LN()
values). Being only slight factors off, we can attribute the difference to the usual benchmark flaws, although it’s interesting to see thatDOUBLE PRECISION
seems 1.5x slower to sum thanBINARY_DOUBLE
and evenNUMBER
- Statements 1, 5, 9 are the ones where no data type conversion is applied and
SUM(LN(the_value))
is being calculated. It is staggering how much slower bothNUMBER
andDOUBLE PRECISION
are thanBINARY_DOUBLE
. Statements 1 and 5 are a factor of 112x slower than statement 9! - Statements 2-3, 6-7 prove that converting the
NUMBER
orDOUBLE PRECISION
vales toBINARY_DOUBLE
on the fly provides a sufficiently performant workaround, which made statements 2-3, 6-7 only 3x slower than statement 9 - Statements 2-3, 6-7 show that casting and converting are about equivalent
EXP()
Analysis
The order of magnitude may seem surprising at first, but thinking about it, it is not. We would never do CPU intensive computation withjava.math.BigDecimal
in Java. The BigDecimal
type is there for numeric accuracy, e.g. when it really matters what the monetary amount is, exactly. When we run analytics on monetary amounts, using double
is sufficient in Java as well.
If our data is BigDecimal
, and we cannot reasonably change that, it might still be better to use the BigDecimal::doubleValue
conversion prior to further processing using e.g. Math::log
. So, this translates directly to SQL, whose LN()
implementations are data type specific. IEEE 754 having been designed precisely for this purpose.
When doing CPU intensive computations both in Java, or in the database, we should always evaluate our various options of
- Quick fixing our data sets for the report (ad-hoc conversion prior to calculation)
- Thoroughly fixing our data sets in the schema (migration towards a better data type)
- Preprocessing our data sets (precalculating some very commonly used computations)
Benchmark logic
Just run the below on your own hardware. I’m curious to see your results:
-- Copyright Data Geekery GmbH
--
-- Licensed under the Apache License, Version 2.0 (the "License");
-- you may not use this file except in compliance with the License.
-- You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing, software
-- distributed under the License is distributed on an "AS IS" BASIS,
-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-- See the License for the specific language governing permissions and
-- limitations under the License.
--
-- This version displays relative execution times (fastest execution = 1)
SET SERVEROUTPUT ON
ALTER SYSTEM FLUSH SHARED_POOL;
ALTER SYSTEM FLUSH BUFFER_CACHE;
CREATE TABLE data (
n1 NUMBER(20, 10),
n2 NUMBER(20, 10),
d1 DOUBLE PRECISION,
d2 DOUBLE PRECISION,
b1 BINARY_DOUBLE,
b2 BINARY_DOUBLE
);
INSERT INTO data
SELECT level, ln(level), level, ln(level), level, ln(level)
FROM dual
CONNECT BY level <= 100000;
CREATE TABLE results (
run NUMBER(2),
stmt NUMBER(2),
elapsed NUMBER
);
DECLARE
v_ts TIMESTAMP WITH TIME ZONE;
v_repeat CONSTANT NUMBER := 10;
v_stmt NUMBER;
BEGIN
-- Repeat the whole benchmark several times to avoid warmup penalty
FOR r IN 1..5 LOOP
v_ts := SYSTIMESTAMP;
v_stmt := 1;
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT sum(ln(n1)) FROM data
) LOOP
NULL;
END LOOP;
END LOOP;
INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
v_ts := SYSTIMESTAMP;
v_stmt := v_stmt + 1;
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT sum(ln(cast(n1 as binary_double))) FROM data
) LOOP
NULL;
END LOOP;
END LOOP;
INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
v_ts := SYSTIMESTAMP;
v_stmt := v_stmt + 1;
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT sum(ln(to_binary_double(n1))) FROM data
) LOOP
NULL;
END LOOP;
END LOOP;
INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
v_ts := SYSTIMESTAMP;
v_stmt := v_stmt + 1;
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT sum(n2) FROM data
) LOOP
NULL;
END LOOP;
END LOOP;
INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
v_ts := SYSTIMESTAMP;
v_stmt := v_stmt + 1;
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT sum(ln(d1)) FROM data
) LOOP
NULL;
END LOOP;
END LOOP;
INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
v_ts := SYSTIMESTAMP;
v_stmt := v_stmt + 1;
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT sum(ln(cast(d1 as binary_double))) FROM data
) LOOP
NULL;
END LOOP;
END LOOP;
INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
v_ts := SYSTIMESTAMP;
v_stmt := v_stmt + 1;
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT sum(ln(to_binary_double(d1))) FROM data
) LOOP
NULL;
END LOOP;
END LOOP;
INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
v_ts := SYSTIMESTAMP;
v_stmt := v_stmt + 1;
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT sum(d2) FROM data
) LOOP
NULL;
END LOOP;
END LOOP;
INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
v_ts := SYSTIMESTAMP;
v_stmt := v_stmt + 1;
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT sum(ln(b1)) FROM data
) LOOP
NULL;
END LOOP;
END LOOP;
INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
v_ts := SYSTIMESTAMP;
v_stmt := v_stmt + 1;
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT sum(b2) FROM data
) LOOP
NULL;
END LOOP;
END LOOP;
INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
END LOOP;
FOR rec IN (
SELECT
run, stmt,
CAST(elapsed / MIN(elapsed) OVER() AS NUMBER(10, 5)) ratio,
CAST(AVG(elapsed) OVER (PARTITION BY stmt) / MIN(elapsed) OVER() AS NUMBER(10, 5)) avg_ratio
FROM results
ORDER BY run, stmt
)
LOOP
dbms_output.put_line('Run ' || rec.run ||
', Statement ' || rec.stmt ||
' : ' || rec.ratio || ' (avg : ' || rec.avg_ratio || ')');
END LOOP;
dbms_output.put_line('');
dbms_output.put_line('Copyright Data Geekery GmbH');
dbms_output.put_line('https://www.jooq.org/benchmark');
END;
/
DROP TABLE data;
DROP TABLE results;