SambaNova Runtime release notes

Release 1.17 (2023-10-31)

New features

Error handling improvements

  • Added x86 host to XRDU PCIe connectivity link health monitoring support from host side (HIC port) on all platforms. Any error in the host PCIe link to each RDU will now be monitored.

  • SambaNova Runtime RAS Support: Enabled synchronous error handling support for memory UEs (uncorrectable errors) on SN20 and SN30. Graph execution failure will be detected sooner, resulting in faster remediation.

Documentation improvements

Release 1.16 (2023-07-14)

New features and other improvements

  • Improved power management stability on SN-30

  • Enabled RDU-level reset on SN30 systems and standardized RDU reset log message for success or failure.

  • Improvements to snconfig:

    • New command for displaying RDU driver operational state. For example: snconfig show Node op-state

    • Add MTU to snconfig rdma-dev output, as shown below:

      =======           RDMA Device Info         =======
      IBDEV      NETDEV      LINKSTATUS      CARRIERSTATUS   SPEED       MTU
      mlx5_0     ib0         Down            NA              NA          4092
      mlx5_1     snni0       Up              OK              100.0G      4200
  • Improvements to errors and logs

    • For Error messages, now printing process ID to console.

    • Updated sn.log and snd.log log levels to match the industry-standard. Log levels form a hierarchy where a higher log level includes all the lower levels. For example, setting the log level to WARNING logs warnings, errors, and critical messages but ignores debug and info messages.

    • When a job is canceled, that is, it doesn’t stop because of a failure, information in sn.log is now clearer.

    • Standardized logging date and time format across sn.log and snd.log for both Red Hat and Ubuntu.

Driver support

  • Added RDU driver support for TOSS 4.5.5

  • Added support for BRCM driver 222.0.150.0-1

Documentation improvements

  • Now documenting the SNML APIs, which you can use for requesting information about RDU status, managing RDUs, querying information about the host, and performing other DataScale system tasks.

Bug fixes

  • Make program load address 512B aligned

Release 1.15 (2023-03-30)

New features and other improvements

  • Added timing information to description for faults that are manually cleared.

  • Added snconfig options to configure host hugepage settings. Run snconfig set hugepages -–help for details.

Fault management improvements

  • Improved error and fault handling policies and descriptions for RDU tiles, PCIe links and device memory.

  • Enhanced error messages in case of resource allocation failure.

  • Tile reset status is recorded in the fault management Error log. Use the snfadm tool to access the SNFM logs.

  • PCIe link faults will include a list of potential components that may be the reason for link issues in the link connectivity path.

Bug fixes

  • Improved device memory initialization and recycling

  • Fixed bugs in RDU tile resource management and allocation.

Supported components and versions

XRDU firmware

For XRDU firmware information, see the Hardware release notes:

Operating systems

  • Red Hat Enterprise Linux 8.5.

  • Ubuntu Linux 20.04 LTS.

Deprecated components

  • FTYPE_BAD_TILE tile fault has been renamed to FTYPE_TILE_EXCLUDED.

  • FTYPE_BAD_RDU is deprecated. Instead, you see FTYPE_RDU_INIT, FTYPE_RDU_RESET, or FTYPE_RDU_REMOVED, depending upon the reason for the RDU fault.

  • RampUpErrorException: Starting with release 1.16, this exception is no longer one of the SambaNova Runtime exceptions.

Release 1.14 (2023-01-10)

New features and other improvements

  • Configuration of RoCE before running any Data Parallel workloads is no longer required.

  • Enhanced fault management for both inventory and fault reporting.

Supported components and versions (unchanged)

  • XRDU firmware

    • OpenBMC version 1.4.1.

    • XRDU version 2.5.0.

  • Operating systems

    • Red Hat Enterprise Linux 8.5.

    • Ubuntu Linux 20.04 LTS.

Deprecated components

  • RampUpErrorException: Starting with release 1.15, this exception is no longer defined as one of the SambaNova Runtime hardware error exceptions.

Release 1.13 (2022-11-03)

New features and other improvements

  • New features

    • Enabled RDU reset from VM.

    • Improved fault management policies for PCIe link errors and on-chip memory errors.

    • Added option to sntilestat to skip idle tiles.

    • Updated SambaNova Runtime APIs for equal functionality between C and C++ interfaces.

    • Enhanced multi-processing support for SambaNova Runtime APIs.

    • Enhanced host profiling information and detailed timeline view in SambaTune.

    • Enhanced snprof and added more robust fault reporting in snstat.

    • Added requirement to properly configure RoCE before running any DP workloads.

  • Fault management improvements

    • Enhanced fault management for both inventory and fault reporting.

  • Infrastructure

    • Improved packaging and simplified software dependencies.

  • Performance improvements

    • Faster SambaFlow context creation.

    • More efficient CPU usage.

    • Better performance for scaleout operations.

Bug fixes

  • Improved robustness of RDU Initialization process in SND.

  • Enhanced local fabric enumeration for inter-RDU communications.

  • Improved robustness of scaleout operations (all reduce, all gather).

Supported components and versions

  • XRDU Firmware

    • OpenBMC version 1.4.1.

    • XRDU version 2.5.0.

  • Operating Systems

    • Red Hat Enterprise Linux 8.5.

    • Ubuntu Linux 20.04 LTS.

Release 1.12.7 (2022-07-30)

New features

  • Added SambaTune: a tool that supports profiling application performance.

  • Enabled C++ SambaRuntime Tensors ZeroCopy via pinned memory.

  • Improved Scale-out performance through parallel reduce.

  • Enhanced RDU reset support with VM.

Supported components and versions

Operating Systems

  • Red Hat Enterprise Linux 8.5

  • Ubuntu Linux 20.04 LTS

Software

  • Updated PEF to version 2.0.0. Models must be recompiled to be used with this release due to the PEF version change.

  • Version 2 of SambaFlow compiler scheduler, specified with option --mac-v2, is now the default. The --mac-v1 will continue to be supported but requires using explicit option.

Deprecated components

  • The global virtual environment under /opt/sambaflow/venv is deprecated and will be removed in version 1.13. It will be replaced by individual virtual environments for each model.