Methodology
We extend and integrate QPyTorch in this project.
All formats are simulated with single-precision floating point.
Thus, the highest integer precision that can be realized is 24-bit, and the highest floating point dynamic range 8-bit.
Format shorthands
Fopr convenience, we use string shorthands to specify numerical formats.
For example, one can instantiate a format object by:
from numerical import Format
input_format = Format.from_shorthand("BFP[8|8]{64,-1}(SN)")
This is equivalent to:
from numerical import BlockFloatingPoint
input_format = BlockFloatingPoint(
precision=8,
block_size=64,
block_dim=-1,
symmetric=True,
rounding="nearest",
)
Shorthand strings are composed of 4 parts:
IDENTIFIER[element_spec]{tensor_spec}(cast_behavior)
Only the first part, i.e. the identifier, is required, the rest being conditional upon specific formats.
Same
This is a dummy format, cast into this format is a no-op.
The shorthand is:
SAME
Floating point
This is a floating point format, with each element having an optional sign bit s (1 for signed and 0 for unsigned), a m-bit mantissa, an e-bit exponent and an exponent bias b.
Two casting behavior are supported:
X: flush submornals, which isFfor flushing,_for not flushing.Y: rounding mode, which isNfor nearest (even when tied),Sfor stochastic rounding.
The shorthand is:
FP[s|e|m,b](XY)
Block floating point
This is a block floating point format, with each element having a n-bit signed integer significand and an 8-bit shared exponent.
Blocks are groups of b contiguous elements along tensor dimension d.
One casting behavior is supported:
X: rounding mode, which isNfor nearest,Sfor stochastic rounding.
The shorthand is:
BFP[n|8]{b,d}(X)
Fixed point
This is a fixed point format, with each element having a n-bit signed integer significand.
Position of the radix point is specified by a bias of ±b-bit shift.
Three casting behavior XYZ are supported, in exact order as follows:
X: clamping of out-of-range numbers, which isCfor clamp,Ufor unclamp.Y: symmetric/asymmetric quantization range, which isSfor symmetric,Afor asymmetric.Z: rounding mode, which isNfor nearest,Sfor stochastic rounding.
The shorthand is:
XP[n,±b](XYZ)