SambaNova Runtime release notes
Release 1.17
New features
-
Enhanced, expanded, and renamed several SNML APIs. Updated documentation in SNML API reference (all users) and SNML API reference (admin users).
Error handling improvements
-
Added x86 host to XRDU PCIe connectivity link health monitoring support from host side (HIC port) on all platforms. Any error in the host PCIe link to each RDU will now be monitored.
-
RDU-related faults are visible through
snfadm
. -
The Host PCIe related faults will be logged.
-
-
SambaNova Runtime RAS Support: Enabled synchronous error handling support for memory UEs (uncorrectable errors) on SN20 and SN30. Graph execution failure will be detected sooner, resulting in faster remediation.
Documentation improvements
-
Additions to SNML API reference (all users) and SNML API reference (admin users).
-
Additions and corrections in How SambaNova developers can use Slurm.
Release 1.16
New features and other improvements
-
Improved power management stability on SN-30
-
Enabled RDU-level reset on SN30 systems and standardized RDU reset log message for success or failure.
-
Improvements to
snconfig
:-
New command for displaying RDU driver operational state. For example:
snconfig show Node op-state
-
Add MTU to
snconfig rdma-dev
output, as shown below:======= RDMA Device Info ======= IBDEV NETDEV LINKSTATUS CARRIERSTATUS SPEED MTU mlx5_0 ib0 Down NA NA 4092 mlx5_1 snni0 Up OK 100.0G 4200
-
-
Improvements to errors and logs
-
For
Error
messages, now printing process ID to console. -
Updated
sn.log
andsnd.log
log levels to match the industry-standard. Log levels form a hierarchy where a higher log level includes all the lower levels. For example, setting the log level to WARNING logs warnings, errors, and critical messages but ignores debug and info messages. -
When a job is canceled, that is, it doesn’t stop because of a failure, information in
sn.log
is now clearer. -
Standardized logging date and time format across
sn.log
andsnd.log
for both Red Hat and Ubuntu.
-
Documentation improvements
-
Now documenting the SNML APIs, which you can use for requesting information about RDU status, managing RDUs, querying information about the host, and performing other DataScale system tasks.
Release 1.15
New features and other improvements
-
Added timing information to description for faults that are manually cleared.
-
Added
snconfig
options to configure host hugepage settings. Runsnconfig set hugepages -–help
for details.
Fault management improvements
-
Improved error and fault handling policies and descriptions for RDU tiles, PCIe links and device memory.
-
Enhanced error messages in case of resource allocation failure.
-
Tile reset status is recorded in the fault management Error log. Use the snfadm tool to access the SNFM logs.
-
PCIe link faults will include a list of potential components that may be the reason for link issues in the link connectivity path.
Bug fixes
-
Improved device memory initialization and recycling
-
Fixed bugs in RDU tile resource management and allocation.
Supported components and versions
XRDU firmware
For XRDU firmware information, see the Hardware release notes:
Operating systems
-
Red Hat Enterprise Linux 8.5.
-
Ubuntu Linux 20.04 LTS.
Deprecated components
-
FTYPE_BAD_TILE tile fault has been renamed to FTYPE_TILE_EXCLUDED.
-
FTYPE_BAD_RDU is deprecated. Instead, you see FTYPE_RDU_INIT, FTYPE_RDU_RESET, or FTYPE_RDU_REMOVED, depending upon the reason for the RDU fault.
-
RampUpErrorException: Starting with release 1.16, this exception is no longer one of the SambaNova Runtime exceptions.