GGML库学习

最近准备做一些whisper相关的工作，而whisper是基于GGMl开发的，于是在这里记录常用的API以及基本架构。这是一个小体量的，专注于 Transformer架构模型推理的机器学习库。

以下是huggingFace博客中对于其中一些对象的记录：

- gglm_contex：装载各类对象的容器，
- gglm_cgraph：计算图表示，可以理解成将要传给后端的计算执行顺序
- gglm_backend：执行计算图的接口，有多个类型：CPU,CUDA,Metal,Vulkan,RPC等
- ggml_backend_buffer_type：表示一种患侧， 可以理解成连接到每个ggml_backend的一个内存分配器。如果要在GPU执行运算，那么就需要通过一个buffer_type来在GPU上分配内存。
ggml_backend_buffer：通过一个buffer_type分配的缓存，一个缓存可以存储多个张量数据。
- ggml_gallocr：表示给一个计算图分配内存的分配器，可一个计算图中的张量进行高效的内存分配
- ggml_bakcend_sched：一个调度器，使得多个后端可以并发使用，在处理大模型或者多个GPU进行推理是，你呢巩固实现跨硬件平台的分配计算任务（如CPU+GPU混合计算）。该调度器还可以将GPU上面不支持的算子移动到，CPU上面。

硬编码后端为CPU使用ggml的基本步骤

1. 分配一个 ggml_context 来存储张量数据
2. 分配张量并赋值
3. 为矩阵乘法运算创建一个 ggml_cgraph
4. 执行计算
5. 获取计算结果
6. 释放内存并退出

对于文章中示例的解析

#include "ggml.h"
#include "ggml-cpu.h"
#include <string.h>
#include <stdio.h>

int main(void) {
    // initialize data of matrices to perform matrix multiplication
    const int rows_A = 4, cols_A = 2;
    float matrix_A[rows_A * cols_A] = {
        2, 8,
        5, 1,
        4, 2,
        8, 6
    };
    const int rows_B = 3, cols_B = 2;
    float matrix_B[rows_B * cols_B] = {
        10, 5,
        9, 9,
        5, 4
    };

    // 1. Allocate `ggml_context` to store tensor data
    // Calculate the size needed to allocate
    //ggml不通过new等操作申请空间，而是通过计算得到一大块空间，在其中分配struct ggml_tensor以及tensor->data空间
    size_t ctx_size = 0;
    //ggml_type_size(GGML_TYPE_F32) 返回 sizeof(float) (通常是 4 字节)。
    ctx_size += rows_A * cols_A * ggml_type_size(GGML_TYPE_F32); // tensor a
    ctx_size += rows_B * cols_B * ggml_type_size(GGML_TYPE_F32); // tensor b
    ctx_size += rows_A * rows_B * ggml_type_size(GGML_TYPE_F32); // result
    //以上是tensor->data数据空间，接下来是struct ggml_tensor 这个结构体本身的空间
    ctx_size += 3 * ggml_tensor_overhead(); // metadata for 3 tensors
    //struct ggml_cgraph (计算图) 本身也是一个对象，它也需要内存。
    ctx_size += ggml_graph_overhead(); // compute graph
    //安全缓冲区，方便进行内存对齐。
    ctx_size += 1024; // some overhead (exact calculation omitted for simplicity)

    // Allocate `ggml_context` to store tensor data
    struct ggml_init_params params = {
        /*.mem_size =*/ ctx_size,
        /*.mem_buffer =*/ NULL,
        /*.no_alloc =*/ false,


        //实际在whisper中，
        //第一遍 (Pass 1 - 测量)：
        // 设置 params.no_alloc = true。
        // 设置 params.mem_size = 0 (或 1)。
        // 调用 ggml_init(params) 创建一个“测量”上下文。
        // 正常运行所有 ggml_new_tensor 和 ggml_build_forward 代码。
        // ggml 在 no_alloc=true 模式下，不会真的分配内存，它只是在内部累加它“本应该”分配的内存大小（包括所有对齐）。
        // 最后，你可以通过 ggml_get_mem_size(ctx) 拿到一个精确到字节的所需内存大小。
        // 调用 ggml_free(ctx) 释放这个“测量”上下文。
        // 分配真实内存：
        // size_t real_size = ... (从 Pass 1 拿到的大小)。
        // void* real_buffer = malloc(real_size); (这次是真的向系统要内存)。
        // 第二遍 (Pass 2 - 运行)：
        // 设置 params.no_alloc = false。
        // 设置 params.mem_size = real_size。
        // 设置 params.mem_buffer = real_buffer (告诉 ggml 使用你刚分配的这块内存，而不是自己 malloc)。
        // 调用 ggml_init(params) 创建一个“真实”的上下文。
        // 再次运行所有 ggml_new_tensor 和 ggml_build_forward 代码。
        // 最后调用 ggml_graph_compute。
    };
    struct ggml_context * ctx = ggml_init(params);

    // 2. Create tensors and set data
    struct ggml_tensor * tensor_a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_A, rows_A);
    struct ggml_tensor * tensor_b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_B, rows_B);
    memcpy(tensor_a->data, matrix_A, ggml_nbytes(tensor_a));
    memcpy(tensor_b->data, matrix_B, ggml_nbytes(tensor_b));


    // 3. Create a `ggml_cgraph` for mul_mat operation
    struct ggml_cgraph * gf = ggml_new_graph(ctx);

    // result = a*b^T
    // Pay attention: ggml_mul_mat(A, B) ==> B will be transposed internally
    // the result is transposed
    struct ggml_tensor * result = ggml_mul_mat(ctx, tensor_a, tensor_b);

    // Mark the "result" tensor to be computed
    ggml_build_forward_expand(gf, result);

    // 4. Run the computation
    int n_threads = 1; // Optional: number of threads to perform some operations with multi-threading 开始计算
    ggml_graph_compute_with_ctx(ctx, gf, n_threads);

    // 5. Retrieve results (output tensors)
    float * result_data = (float *) result->data;
    printf("mul mat (%d x %d) (transposed result):\n[", (int) result->ne[0], (int) result->ne[1]);
    for (int j = 0; j < result->ne[1]/* rows */; j++) {
        if (j > 0) {
            printf("\n");
        }

        for (int i = 0; i < result->ne[0]/* cols */; i++) {
            printf(" %.2f", result_data[j * result->ne[0] + i]);
        }
    }
    printf(" ]\n");

    // 6. Free memory and exit
    ggml_free(ctx);
    return 0;
}

使用指定后端进行计算或推理，基本步骤

1. 初始化 ggml_backend
2. 分配 ggml_context 以保存张量的 metadata (此时还不需要直接分配张量的数据)
3. 为张量创建 metadata (也就是形状和数据类型)
4. 分配一个 ggml_backend_buffer 用来存储所有的张量
5. 从内存 (RAM) 中复制张量的具体数据到后端缓存
6. 为矩阵乘法创建一个 ggml_cgraph
7. 创建一个 ggml_gallocr 用以分配计算图
8. 可选: 用 ggml_backend_sched 调度计算图
9. 运行计算图
10. 获取结果，即计算图的输出
11. 释放内存并退出

基于后端的代码案例

#include "ggml.h"
#include "ggml-alloc.h"
#include "ggml-backend.h"
#ifdef GGML_USE_CUDA
#include "ggml-cuda.h"
#endif

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

int main(void) {
    // initialize data of matrices to perform matrix multiplication
    const int rows_A = 4, cols_A = 2;
    float matrix_A[rows_A * cols_A] = {
        2, 8,
        5, 1,
        4, 2,
        8, 6
    };
    const int rows_B = 3, cols_B = 2;
    float matrix_B[rows_B * cols_B] = {
        10, 5,
        9, 9,
        5, 4
    };

    // 1. Initialize backend
    //定义抽象后端接口
    ggml_backend_t backend = NULL;
#ifdef GGML_USE_CUDA
    fprintf(stderr, "%s: using CUDA backend\n", __func__);
    backend = ggml_backend_cuda_init(0); // init device 0
    if (!backend) {
        fprintf(stderr, "%s: ggml_backend_cuda_init() failed\n", __func__);
    }
#endif//使用CPU后端
    // if there aren't GPU Backends fallback to CPU backend
    if (!backend) {
        backend = ggml_backend_cpu_init();
    }

    // Calculate the size needed to allocate
    //不再手动计算内存
    size_t ctx_size = 0;
    ctx_size += 2 * ggml_tensor_overhead(); // tensors
    // no need to allocate anything else!

    // 2. Allocate `ggml_context` to store tensor data
    struct ggml_init_params params = {
        /*.mem_size =*/ ctx_size,
        /*.mem_buffer =*/ NULL,
        /*.no_alloc =*/ true, // the tensors will be allocated later by ggml_backend_alloc_ctx_tensors()
    };
    struct ggml_context * ctx = ggml_init(params);

    // Create tensors metadata (only there shapes and data type)
    struct ggml_tensor * tensor_a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_A, rows_A);
    struct ggml_tensor * tensor_b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_B, rows_B);

    // 4. Allocate a `ggml_backend_buffer` to store all tensors
    ggml_backend_buffer_t buffer = ggml_backend_alloc_ctx_tensors(ctx, backend);

    // 5. Copy tensor data from main memory (RAM) to backend buffer
    ggml_backend_tensor_set(tensor_a, matrix_A, 0, ggml_nbytes(tensor_a));
    ggml_backend_tensor_set(tensor_b, matrix_B, 0, ggml_nbytes(tensor_b));

    // 6. Create a `ggml_cgraph` for mul_mat operation
    struct ggml_cgraph * gf = NULL;
    struct ggml_context * ctx_cgraph = NULL;
    {
        // create a temporally context to build the graph
        struct ggml_init_params params0 = {
            /*.mem_size =*/ ggml_tensor_overhead()*GGML_DEFAULT_GRAPH_SIZE + ggml_graph_overhead(),
            /*.mem_buffer =*/ NULL,
            /*.no_alloc =*/ true, // the tensors will be allocated later by ggml_gallocr_alloc_graph()
        };
        ctx_cgraph = ggml_init(params0);
        gf = ggml_new_graph(ctx_cgraph);

        // result = a*b^T
        // Pay attention: ggml_mul_mat(A, B) ==> B will be transposed internally
        // the result is transposed
        struct ggml_tensor * result0 = ggml_mul_mat(ctx_cgraph, tensor_a, tensor_b);

        // Add "result" tensor and all of its dependencies to the cgraph
        ggml_build_forward_expand(gf, result0);
    }

    // 7. Create a `ggml_gallocr` for cgraph computation
    ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(backend));
    ggml_gallocr_alloc_graph(allocr, gf);

    // (we skip step 8. Optionally: schedule the cgraph using `ggml_backend_sched`)

    // 9. Run the computation
    int n_threads = 1; // Optional: number of threads to perform some operations with multi-threading
    if (ggml_backend_is_cpu(backend)) {
        ggml_backend_cpu_set_n_threads(backend, n_threads);
    }
    ggml_backend_graph_compute(backend, gf);

    // 10. Retrieve results (output tensors)
    // in this example, output tensor is always the last tensor in the graph
    struct ggml_tensor * result = gf->nodes[gf->n_nodes - 1];
    float * result_data = malloc(ggml_nbytes(result));
    // because the tensor data is stored in device buffer, we need to copy it back to RAM
    ggml_backend_tensor_get(result, result_data, 0, ggml_nbytes(result));
    printf("mul mat (%d x %d) (transposed result):\n[", (int) result->ne[0], (int) result->ne[1]);
    for (int j = 0; j < result->ne[1]/* rows */; j++) {
        if (j > 0) {
            printf("\n");
        }

        for (int i = 0; i < result->ne[0]/* cols */; i++) {
            printf(" %.2f", result_data[j * result->ne[0] + i]);
        }
    }
    printf(" ]\n");
    free(result_data);

    // 11. Free memory and exit
    ggml_free(ctx_cgraph);
    ggml_gallocr_free(allocr);
    ggml_free(ctx);
    ggml_backend_buffer_free(buffer);
    ggml_backend_free(backend);
    return 0;
}