Tma (tensor memory accelerator) is a new feature introduced in the nvidia hopper™ architecture for doing asynchronous memory copy between a gpu’s global memory. In this section, we introduce the main nvidia gpu architectures that use tensor cores, namely the tesla v100 gpu, a100 tensor core gpu, h100 tensor core gpu, as. Targeting nvidia hopper in mlir 4.
Obituary
Tma was introduced in the. The descriptor handles the creation of the tensor map by using the cutensormapencode api. Warpgroup level (128 threads) ptx instructions matrix a or b can be shared memory or registers supports transpose for f16.
Modified from nvidia's h100 white paper.
To build the tensor map, we first create a tma descriptor on the cpu. The tensor memory accelerator (tma) is a set of instructions for copying possibly multidimensional arrays between global and shared memory. The hopper architecture builds on top of the asynchronous copies introduced by nvidia ampere gpu architecture and provides a more sophisticated asynchronous copy. The tensor memory accelerator (tma) is a hardware unit introduced in nvidia hopper architecture (sm90+) that performs bulk data transfers between global memory and.
The tma loads data from global memory / gpu ram to shared memory / l1 data cache, bypassing the registers / register file entirely. This document explains the tensor memory accelerator (tma) subsystem, a hardware feature available in nvidia hopper architecture gpus that enables efficient data.