Methodology ----------- We extend and integrate `QPyTorch`_ in this project. All formats are simulated with single-precision floating point. Thus, the highest integer precision that can be realized is ``24``-bit, and the highest floating point dynamic range ``8``-bit. .. _QPyTorch: https://github.com/Tiiiger/QPyTorch Format shorthands ----------------- Fopr convenience, we use string shorthands to specify numerical formats. For example, one can instantiate a format object by: .. code-block:: python from numerical import Format input_format = Format.from_shorthand("BFP[8|8]{64,-1}(SN)") This is equivalent to: .. code-block:: python from numerical import BlockFloatingPoint input_format = BlockFloatingPoint( precision=8, block_size=64, block_dim=-1, symmetric=True, rounding="nearest", ) Shorthand strings are composed of 4 parts: .. code-block:: IDENTIFIER[element_spec]{tensor_spec}(cast_behavior) Only the first part, i.e. the identifier, is required, the rest being conditional upon specific formats. Same ~~~~ This is a dummy format, cast into this format is a no-op. The shorthand is:: SAME Floating point ~~~~~~~~~~~~~~ This is a floating point format, with each element having an optional sign bit ``s`` (``1`` for signed and ``0`` for unsigned), a ``m``-bit mantissa, an ``e``-bit exponent and an exponent bias ``b``. Two casting behavior are supported: * ``X``: flush submornals, which is ``F`` for flushing, ``_`` for not flushing. * ``Y``: rounding mode, which is ``N`` for nearest (even when tied), ``S`` for stochastic rounding. The shorthand is:: FP[s|e|m,b](XY) Block floating point ~~~~~~~~~~~~~~~~~~~~ This is a block floating point format, with each element having a ``n``-bit signed integer significand and an ``8``-bit shared exponent. Blocks are groups of ``b`` contiguous elements along tensor dimension ``d``. One casting behavior is supported: * ``X``: rounding mode, which is ``N`` for nearest, ``S`` for stochastic rounding. The shorthand is:: BFP[n|8]{b,d}(X) Fixed point ~~~~~~~~~~~ This is a fixed point format, with each element having a ``n``-bit signed integer significand. Position of the radix point is specified by a bias of ``±b``-bit shift. Three casting behavior ``XYZ`` are supported, in exact order as follows: * ``X``: clamping of out-of-range numbers, which is ``C`` for clamp, ``U`` for unclamp. * ``Y``: symmetric/asymmetric quantization range, which is ``S`` for symmetric, ``A`` for asymmetric. * ``Z``: rounding mode, which is ``N`` for nearest, ``S`` for stochastic rounding. The shorthand is:: XP[n,±b](XYZ)