Performance insights report (Beta)

The performance insights report highlights bottlenecks, including RDU versus Host, critical sections, and critical stages. It also provides initial steps for conducting a deeper analysis of issues.

Locate the report

The performance insights report is available in .JSON format at the following location in your output folder: /reports/analysis/summary.json.

Read the report

{
    "Overview": [
        "Host takes 66.5 % of the total latency.",
        "RDU takes 33.5 % of the total latency.",
        "This model could be Host bound."
    ],
    "Host": [
        {
            "Top 3 time-consuming processes": [
                "SAMBA takes 42.2 % of the host latency. You may take a look at PYINSTRUMENT DATA tab in SambaTune UI OVERVIEW page.",
                "XFER takes 35.6 % of the host latency. You may check tensors in <path_to_report>.",
                "RUN SETUP takes 13.3 % of the host latency."
            ]
        }
    ],
    "RDU": [
        {
            "chip 0, section 2 takes 56.3 % of the RDU latency. The total DDR bandwidth is 95.41 GB/s.": {
                "Analysis": [
                    "Detailed information can be found in <path_to_report>."
                ],
                "Top 3 time-consuming stages": [
                    "Detailed information can be found in <path_to_report>."
                ]
            }
        },
        {
            "chip 0, section 1 takes 28.8 % of the RDU latency. The total DDR bandwidth is 88.92 GB/s.": {
                "Analysis": [
                    "Detailed information can be found in <path_to_report>."
                ],
                "Top 3 time-consuming stages": [
                    "stage 140 takes 19.83 % of the section latency. Template names are dlrm__top_mlp__2__linear_bwd_loss_grad_b. Node names are tlir.Buffer1228, tlir.Linear1558, tlir.Buffer1235.",
                    "stage 132 takes 14.28 % of the section latency. Template names are dlrm__top_mlp__6__linear_bwd_loss_grad_b. Node names are tlir.Buffer1210, tlir.Linear1548, tlir.Buffer1217.",
                    "stage 136 takes 5.83 % of the section latency. Template names are dlrm__top_mlp__4__linear_bwd_loss_grad_b. Node names are tlir.Buffer1219, tlir.Linear1553, tlir.Buffer1226.",
                    "Detailed information can be found in <path_to_report>."
                ]
            }
        },
        {
            "chip 0, section 0 takes 15.0 % of the RDU latency. The total DDR bandwidth is 93.36 GB/s.": {
                "Analysis": [
                    "Detailed information can be found in <path_to_report>."
                ],
                "Top 3 time-consuming stages": [
                    "stage 18 takes 3.72 % of the section latency. Node names are tlir.Buffer950, tlir.Buffer967.",
                    "stage 3 takes 3.72 % of the section latency. Node names are tlir.Buffer965, tlir.Buffer967.",
                    "stage 25 takes 3.72 % of the section latency. Node names are tlir.Buffer943, tlir.Buffer967.",
                    "Detailed information can be found in <path_to_report>."
                ]
            }
        }
    ]
}

The report is structured into three parts: Overview, Host, and RDU.

  • The Overview presents the percentage of time consumed by Host and RDU and identifies potential bottlenecks.

  • The Host section presents the top 3 time-consuming processes, expressed as a percentage of total Host latency. The section offers suggestions such as using pyinstrument, the standard call-stack profiler in Python, or using tensor data.

  • The RDU section presents the top 3 time-consuming sections and their total DDR bandwidth. For each section, it shows analysis and top 3 time-consuming stages as well as their template names and node names.

View the report

The SambaTune UI displays the insights in different levels. Here are examples of performance insights on the SambaTune UI:

Performance insights in Overview level
Performance insights in Host level
Performance insights in RDU level
Performance insights in section level