dbl_to_lofi.Rd
Pack a double into the lower bits of an int32 with the given bits for sign, exponent and mantissa.
dbl_to_lofi(dbl, float_name = "bfloat16", float_bits = NULL) lofi_to_dbl(lofi, float_name = "bfloat16", float_bits = NULL)
dbl | 64 bit R double |
---|---|
float_name | 'single', 'half', 'bfloat16'. Default: 'bfloat16' |
float_bits | length (in number of bits) of sign, exponent and mantissa.
Default: NULL. If this value is not null, then it will override
anything the user may have specified for |
lofi | low-bit representation |
32 bit integer with lower bits set to represent the quantized floating point value
By packing into a low fidelity bit representation you will definitely lose precision i.e. converting back into full 64 bit precision will not give you back the number you started with.
Packing a double into low-fidelity format has no explicit support for special values
such as NaN
, NA
or Inf
. These values may get converted
to other numeric values or other special values. The result is undefined.
Operate on special values at your own risk.