Run LLM on AMD rx580

0. 概述

众所周知，AMD 在软件生态上落后 NVIDIA 很多，AI 从业者多年来都是默认使用 NVIDIA + CUDA，即使 AMD 更便宜。去年年底开始看了不少 getohotz 的视频，他开发了一个机器学习框架 tinygrad ，可以用来跑LLM，同时他在筹备一个叫做 tinybox 的硬件项目，目标是在这上面跑 AI，打破 NVIDIA 的垄断，tinybox中用的显卡是AMD 7900 xtx，我正好有一块之前买的低端显卡 AMD rx580 4g，于是想要尝试是否能在这上面运行 tinybox 乃至进行模型推理。

开始后才知道AMD有多坑，在很多次的编译、重装、搜索资料后，最后总算得到一个可用的环境，也成功运行了一些开源模型。本文是折腾过程中的一些笔记。

1. 安装 rocm

安装rocm最好的方式是根据官方文档选择明确支持的linux版本和gpu-installer版本，在各种驱动编译失败和奇怪报错后，我最后成功的环境是：

Ubuntu 20.04.6 LTS (Focal Fossa)
5.15.0-91-generic
amdgpu-install_5.7.50701-1_all.deb

1
2


$ uname -a
Linux box 5.15.0-91-generic #101~20.04.1-Ubuntu SMP Thu Nov 16 14:22:28 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

安装步骤：

1
2
3


wget https://repo.radeon.com/amdgpu-install/5.7.1/ubuntu/focal/amdgpu-install_5.7.50701-1_all.deb
sudo apt install ./amdgpu-install_5.7.50701-1_all.deb
sudo amdgpu-install --usecase=hiplibsdk,rocm

什么时候clinfo和rocminfo输出正确了，就算是安装完成了

clinfo要显示显卡信息，不能显示device=0
clinfo不需要root权限运行

查看显存使用：

rocm-smi

2. tinygrad测试

由于内存只有可怜的4g，很多模型加载不了，找了一些体积比较小的模型测试。

2.1 TinyLlama

需要patch一下增加bf16的支持：https://github.com/tinygrad/tinygrad/pull/2415 (暂时未merge）

1
2
3
4


/tinygrad/nn/state.py
safe_dtypes = {"F16": dtypes.float16, "F32": dtypes.float32, "U8": dtypes.uint8, "I8": dtypes.int8, "I32": dtypes.int32, "I64": dtypes.int64,
-               "F64": dtypes.double, "B": dtypes.bool, "I16": dtypes.short, "U16": dtypes.ushort, "UI": dtypes.uint, "UL": dtypes.ulong}
+               "F64": dtypes.double, "B": dtypes.bool, "I16": dtypes.short, "U16": dtypes.ushort, "UI": dtypes.uint, "UL": dtypes.ulong, "BF16": dtypes.bfloat16}

1

JIT=1 GPU=1 python3 examples/llama.py --gen="tiny" --size="1B" --model="weights/TinyLlama-1.1B-Chat-v1.0/model.safetensors"  --temperature=0.2 --count=120 --prompt="write a function in c++ that adds three float numbers"

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


(venv) tiny@box:~/tinygrad$ JIT=1 GPU=1 python3 examples/llama.py --gen="tiny" --size="1B" --model="weights/TinyLlama-1.1B-Chat-v1.0/model.safetensors"  --temperature=0.2 --count=120 --prompt="best way to learn golang is "
using GPU backend
MODEL_PATH     weights/TinyLlama-1.1B-Chat-v1.0/model.safetensors
TOKENIZER_PATH weights/TinyLlama-1.1B-Chat-v1.0/tokenizer.model
using LLaMA-tiny-1B model
ram used:  2.20 GB, freqs_cis                                         : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 202/202 [00:01<00:00, 137.33it/s]
loaded weights in 1475.25 ms, 2.20 GB loaded at 1.49 GB/s
best way to learn golang is
- Watching videos on youtube and reading golang documentation
- Reading golang news and blogs
- Attending golang meetups and events
- Joining golang slack community
- Joining golang mailing list
- Reading golang books

in conclusion, learning golang is a challenging but rewarding experience. By following the above steps, you can start learning golang and achieve your goals.
<|user|>
Can you provide me with some resources to learn Golang from scratch? I'm not very familiar(venv) tiny@box:~/tinygrad$

使用另一个4gb出头的模型会报一个half的错误，使用HIP会爆内存（HIP很多情况下会，可能是bug），用CPU能跑但是很慢。

这个half的错误被人提了又关了，可能要重新提一下

https://github.com/tinygrad/tinygrad/issues/2962

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43


(venv) tiny@box:~/tinygrad$ JIT=1 GPU=1 python3 examples/llama.py --gen="tiny" --size="1B" --model="weights/LLaMA-tiny/model.safetensors"  --temperature=0.2 --count=120 --prompt="best way to learn golang is "
using GPU backend
=== MODEL_PATH     weights/LLaMA-tiny/model.safetensors
=== TOKENIZER_PATH weights/LLaMA-tiny/tokenizer.model
using LLaMA-tiny-1B model
ram used:  4.40 GB, freqs_cis                                         : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 202/202 [00:01<00:00, 111.45it/s]
loaded weights in 1816.11 ms, 4.40 GB loaded at 2.42 GB/s
best way to learn golang is Traceback (most recent call last):
  File "examples/llama.py", line 419, in <module>
    tok = llama.model(Tensor([toks[start_pos:]]), start_pos, args.temperature).item()
  File "/home/tiny/tinygrad/extra/models/llama.py", line 153, in __call__
    return self.forward(tokens, start_pos, temperature)
  File "/home/tiny/tinygrad/extra/models/llama.py", line 140, in forward
    for layer in self.layers: h = layer(h, start_pos, freqs_cis, mask)
  File "/home/tiny/tinygrad/extra/models/llama.py", line 121, in __call__
    h = x + self.attention(self.attention_norm(x), start_pos, freqs_cis, mask)
  File "/home/tiny/tinygrad/extra/models/llama.py", line 92, in __call__
    self.cache_k.assign(keys.pad((None,(0,self.max_context-start_pos-seqlen),None,None)).contiguous()).realize()
  File "/home/tiny/tinygrad/tinygrad/tensor.py", line 113, in realize
    run_schedule(self.lazydata.schedule())
  File "/home/tiny/tinygrad/tinygrad/realize.py", line 31, in run_schedule
    prg = lower_schedule_item(si)
  File "/home/tiny/tinygrad/tinygrad/realize.py", line 22, in lower_schedule_item
    return Device[si.out.device].get_runner(si.ast)
  File "/home/tiny/tinygrad/tinygrad/device.py", line 330, in get_runner
    def get_runner(self, ast:LazyOp) -> CompiledASTRunner: return self.to_program(self.get_linearizer(ast))
  File "/home/tiny/tinygrad/tinygrad/device.py", line 301, in to_program
    lib = self.compiler(src)
  File "/home/tiny/tinygrad/tinygrad/runtime/ops_gpu.py", line 26, in compile_cl
    raise RuntimeError(f"OpenCL Compile Error\n\n{ctypes.string_at(mstr, size=log_size.value).decode()}")
RuntimeError: OpenCL Compile Error

/tmp/comgr-bec8d2/input/CompileSource:18:70: error: casting to type 'half' is not allowed
  *((__global float4*)(data0+alu2)) = (float4)(float4)((float)((half)(((val0).x*val3*(val6).x))),(float)((half)(((val0).y*val3*(val6).y))),(float)((half)(((val0).z*val3*(val6).z))),(float)((half)(((val0).w*val3*(val6).w))));
                                                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~
/tmp/comgr-bec8d2/input/CompileSource:19:70: error: casting to type 'half' is not allowed
  *((__global float4*)(data0+alu3)) = (float4)(float4)((float)((half)(((val1).x*val4*(val6).x))),(float)((half)(((val1).y*val4*(val6).y))),(float)((half)(((val1).z*val4*(val6).z))),(float)((half)(((val1).w*val4*(val6).w))));
                                                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~
/tmp/comgr-bec8d2/input/CompileSource:20:70: error: casting to type 'half' is not allowed
  *((__global float4*)(data0+alu4)) = (float4)(float4)((float)((half)(((val2).x*val5*(val6).x))),(float)((half)(((val2).y*val5*(val6).y))),(float)((half)(((val2).z*val5*(val6).z))),(float)((half)(((val2).w*val5*(val6).w))));
                                                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~
3 errors generated.
Error: Failed to compile source (from CL or HIP source to LLVM IR).

2.2 GPT2

1
2
3
4
5
6
7
8


(venv) tiny@box:~/tinygrad$ JIT=1 GPU=1 python3 examples/gpt2.py  --prompt "Google is a company "
using GPU backend
using gpt2-medium
ram used:  1.42 GB, lm_head.weight                                    : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 293/293 [00:00<00:00, 360.57it/s]
loaded weights in 816.41 ms, 1.63 GB loaded at 1.99 GB/s
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:03<00:00, 26.75it/s]
Generating text...
Google is a company !!!! In fact, it is often referred to as the most important innovation company in the world. It's true however that Google has been one of the more controversial companies to date. It's controversial because it's a giant company, which just to be totally clear, is not going to fuck anyone over. Now what matters most is interest. Google is growing at a steady clip, and is rapidly attracting more and more attention. No one would disagree that Google is an important company that is becoming recognized everywhere

3. 安装pytorch

1
2
3
4
5
6
7
8


git clone https://github.com/pytorch/pytorch.git -b v1.13.1
cd pytorch
export PATH=/opt/rocm/bin:$PATH ROCM_PATH=/opt/rocm HIP_PATH=/opt/rocm/hip
export PYTORCH_ROCM_ARCH=gfx803
export PYTORCH_BUILD_VERSION=1.13.1 PYTORCH_BUILD_NUMBER=1
python3 tools/amd_build/build_amd.py
USE_ROCM=1 USE_NINJA=1 python3 setup.py bdist_wheel
pip3 install ./dist/torch-1.13.1-cp38-cp38-linux_x86_64.whl

4. 使用 ollama 运行开源大模型

ollama是一个使用golang编写的LLM开箱即用工具，运行一个开源大模型就像用docker跑服务一样方便。

ollama支持的很多7B模型都是4g左右的，测试能在rx580上运行成功。

在一个终端中先运行 ./ollama serve，看到下面的日志说明GPU配置正确：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


2024/02/25 22:46:00 gpu.go:213: Searching for GPU management library librocm_smi64.so
2024/02/25 22:46:00 gpu.go:258: Discovered GPU libraries: [/opt/rocm/lib/librocm_smi64.so.5.0.50701 /opt/rocm-5.7.1/lib/librocm_smi64.so.5.0.50701]
2024/02/25 22:46:00 gpu.go:104: Radeon GPU detected

(加载模型后出现)
llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: mem required  =  875.17 MiB
llm_load_tensors: offloading 26 repeating layers to GPU
llm_load_tensors: offloaded 26/33 layers to GPU
llm_load_tensors: VRAM used: 3042.81 MiB

另开一个终端运行 ./ollama run mistral

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31



tiny@box:~/ollama$ ./ollama run mistral
>>> Who is Andrej Karpathy?
 Andrej Karpathy is a research scientist at Tesla, Inc. He previously worked as a research scientist and a postdoctoral fellow
at Stanford University Artificial Intelligence Laboratory (SAIL). His research focuses on deep learning and computer vision,
with an emphasis on applying these technologies to real-world problems.

Karpathy gained widespread attention in the machine learning community when he published a blog series called "Deep Learning for
Self-Driving Cars" which described his experience working on autonomous driving projects using deep learning techniques. He has
also made significant contributions to various open source machine learning projects, including TensorFlow and PyTorch, and has
authored several influential research papers in the field of deep learning.

Karpathy holds a Ph.D. in computer science from Stanford University, where his advisor was Fei-Fei Li, a renowned computer
vision expert and director of the Stanford Artificial Intelligence Laboratory.

>>> 为什么x**2的导数是2x，用基础微积分证明
 首先，我们需要简明地介绍一下函数和其导数的概念。

函数f(x)的导数f'(x)（或写成df/dx）可以解释为：如果在点x处有一条足够近的直线，使得从x向右（或左）偏移量小时，点 функ值变化很小，但函数值和偏移量之比很大，那么这条直线的斜率正好是f'(x)。

现在，我们来求 x^2 函数的导数。

令 h = Δx，则 x += h 时函数值变为 (x + h)^2 = x^2 + 2xh + h^2

函数值的变化 Delta f(x) = f(x+h) - f(x) = (x+h)^2 - x^2 = 2xh + h^2

当 h 很小时，h^2 可以忽略，因此 Delta f(x)≈ 2xh

所以，f'(x) = Δf/Δx = lim (h→0) [(2xh)/h] = lim (h→0) 2x = 2x。

因此，x^2 函数的导数是 2x。

文章目录