SambaNova Runtime release notes

Release 1.16 (2023-07-14)

New features and other improvements

  • Improved power management stability on SN-30

  • Enabled RDU-level reset on SN30 systems and standardized RDU reset log message for success or failure.

  • Improvements to snconfig:

    • New command for displaying RDU driver operational state. For example: snconfig show Node op-state

    • Add MTU to snconfig rdma-dev output, as shown below:

      =======           RDMA Device Info         =======
      IBDEV      NETDEV      LINKSTATUS      CARRIERSTATUS   SPEED       MTU
      mlx5_0     ib0         Down            NA              NA          4092
      mlx5_1     snni0       Up              OK              100.0G      4200
  • Improvements to errors and logs

    • For Error messages, now printing process ID to console.

    • Updated sn.log and snd.log log levels to match the industry-standard. Log levels form a hierarchy where a higher log level includes all the lower levels. For example, setting the log level to WARNING logs warnings, errors, and critical messages but ignores debug and info messages.

    • When a job is canceled, that is, it doesn’t stop because of a failure, information in sn.log is now clearer.

    • Standardized logging date and time format across sn.log and snd.log for both Red Hat and Ubuntu.

Driver support

  • Added RDU driver support for TOSS 4.5.5

  • Added support for BRCM driver 222.0.150.0-1

Documentation improvements

  • Now documenting the SNML APIs, which you can use for requesting information about RDU status, managing RDUs, querying information about the host, and performing other DataScale system tasks.

Bug fixes

  • Make program load address 512B aligned

Release 1.15 (2023-03-30)

New features and other improvements

  • Added timing information to description for faults that are manually cleared.

  • Added snconfig options to configure host hugepage settings. Run snconfig set hugepages -–help for details.

Fault management improvements

  • Improved error and fault handling policies and descriptions for RDU tiles, PCIe links and device memory.

  • Enhanced error messages in case of resource allocation failure.

  • Tile reset status is recorded in the fault management Error log. Use the snfadm tool to access the SNFM logs.

  • PCIe link faults will include a list of potential components that may be the reason for link issues in the link connectivity path.

Bug fixes

  • Improved device memory initialization and recycling

  • Fixed bugs in RDU tile resource management and allocation.

Supported components and versions

XRDU firmware

For XRDU firmware information, see the Hardware release notes:

Operating systems

  • Red Hat Enterprise Linux 8.5.

  • Ubuntu Linux 20.04 LTS.

Deprecated components

  • FTYPE_BAD_TILE tile fault has been renamed to FTYPE_TILE_EXCLUDED.

  • FTYPE_BAD_RDU is deprecated. Instead, you see FTYPE_RDU_INIT, FTYPE_RDU_RESET, or FTYPE_RDU_REMOVED, depending upon the reason for the RDU fault.

  • RampUpErrorException: Starting with release 1.16, this exception is no longer one of the SambaNova Runtime exceptions.

Release 1.14 (2023-01-10)

New features and other improvements

  • Configuration of RoCE before running any Data Parallel workloads is no longer required.

  • Enhanced fault management for both inventory and fault reporting.

Supported components and versions (unchanged)

  • XRDU firmware

    • OpenBMC version 1.4.1.

    • XRDU version 2.5.0.

  • Operating systems

    • Red Hat Enterprise Linux 8.5.

    • Ubuntu Linux 20.04 LTS.

Deprecated components

  • RampUpErrorException: Starting with release 1.15, this exception is no longer defined as one of the SambaNova Runtime hardware error exceptions.

Release 1.13 (2022-11-03)

New features and other improvements

  • New features

    • Enabled RDU reset from VM.

    • Improved fault management policies for PCIe link errors and on-chip memory errors.

    • Added option to sntilestat to skip idle tiles.

    • Updated SambaNova Runtime APIs for equal functionality between C and C++ interfaces.

    • Enhanced multi-processing support for SambaNova Runtime APIs.

    • Enhanced host profiling information and detailed timeline view in SambaTune.

    • Enhanced snprof and added more robust fault reporting in snstat.

    • Added requirement to properly configure RoCE before running any DP workloads.

  • Fault management improvements

    • Enhanced fault management for both inventory and fault reporting.

  • Infrastructure

    • Improved packaging and simplified software dependencies.

  • Performance improvements

    • Faster SambaFlow context creation.

    • More efficient CPU usage.

    • Better performance for scaleout operations.

Bug fixes

  • Improved robustness of RDU Initialization process in SND.

  • Enhanced local fabric enumeration for inter-RDU communications.

  • Improved robustness of scaleout operations (all reduce, all gather).

Supported components and versions

  • XRDU Firmware

    • OpenBMC version 1.4.1.

    • XRDU version 2.5.0.

  • Operating Systems

    • Red Hat Enterprise Linux 8.5.

    • Ubuntu Linux 20.04 LTS.

Release 1.12.7 (2022-07-30)

New features

  • Added SambaTune: a tool that supports profiling application performance.

  • Enabled C++ SambaRuntime Tensors ZeroCopy via pinned memory.

  • Improved Scale-out performance through parallel reduce.

  • Enhanced RDU reset support with VM.

Supported components and versions

Operating Systems

  • Red Hat Enterprise Linux 8.5

  • Ubuntu Linux 20.04 LTS

Software

  • Updated PEF to version 2.0.0. Models must be recompiled to be used with this release due to the PEF version change.

  • Version 2 of SambaFlow compiler scheduler, specified with option --mac-v2, is now the default. The --mac-v1 will continue to be supported but requires using explicit option.

Deprecated components

  • The global virtual environment under /opt/sambaflow/venv is deprecated and will be removed in version 1.13. It will be replaced by individual virtual environments for each model.