llama.cpp : Install2024/02/22

	Install [llama.cpp] taht is the interface for Meta's Llama (Large Language Model Meta AI) model. The example below is with GPU.
[1]	Install CUDA, refer to here.
[2]	Install other required packages.

[root@dlp ~]#

dnf -y install cudnn9-cuda-12 python3-pip python3-devel python3-numpy gcc gcc-c++ cmake ccache jq

[3]	Build [llama-cpp].

[cent@dlp ~]$

git clone https://github.com/ggerganov/llama.cpp

Cloning into 'llama.cpp'...
remote: Enumerating objects: 18978, done.
remote: Counting objects: 100% (6489/6489), done.
remote: Compressing objects: 100% (540/540), done.
remote: Total 18978 (delta 6260), reused 5989 (delta 5947), pack-reused 12489
Receiving objects: 100% (18978/18978), 21.49 MiB | 19.02 MiB/s, done.
Resolving deltas: 100% (13356/13356), done.

[cent@dlp ~]$

cd llama.cpp

[cent@dlp llama.cpp]$

make LLAMA_CUBLAS=1

# * If you want to build a binary that runs only on the CPU, run only [make] without options

[4]

Download the GGML format model and convert it to GGUF format.
It's possible to download models from the following site. In this example, we will use [llama-2-13b-chat.ggmlv3.q8_0.bin].

⇒ https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/tree/main
⇒ https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main
⇒ https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML/tree/main

[cent@dlp llama.cpp]$

curl -LO https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q8_0.bin?download=true

# convert to GGUF format

[cent@dlp llama.cpp]$

python3 ./convert-llama-ggml-to-gguf.py --input ./llama-2-13b-chat.ggmlv3.q8_0.bin --output ./llama-2-13b-chat.ggmlv3.q8_0.gguf


.....
.....
* Preparing to save GGUF file
gguf: This GGUF file is for Little Endian only
* Adding model parameters and KV items
* Adding 32000 vocab item(s)
* Adding 363 tensor(s)
    gguf: write header
    gguf: write metadata
    gguf: write tensors
* Successful completion. Output saved to: llama-2-13b-chat.ggmlv3.q8_0.gguf

# [--n-gpu-layers] : number of layers to put on the GPU
# -- specify [-1] to use all if you do not know

[cent@dlp llama.cpp]$

./server --model ./llama-2-13b-chat.ggmlv3.q8_0.gguf --n-gpu-layers -1 --host 0.0.0.0 --port 8000 &

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
{"timestamp":1708584995,"level":"INFO","function":"main","line":2573,"message":"build info","build":2234,"commit":"973053d8"}
{"timestamp":1708584995,"level":"INFO","function":"main","line":2576,"message":"system info","n_threads":4,"n_threads_batch":-1,"total_threads":8,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "}

llama server listening at http://0.0.0.0:8000

{"timestamp":1708584995,"level":"INFO","function":"main","line":2731,"message":"HTTP server listening","port":"8000","hostname":"0.0.0.0"}
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from ./llama-2-13b-chat.ggmlv3.q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = llama-2-13b-chat.ggmlv3.q8_0.bin
llama_model_loader: - kv   2:                        general.description str              = converted from legacy GGJTv3 MOSTLY_Q...
llama_model_loader: - kv   3:                          general.file_type u32              = 7
llama_model_loader: - kv   4:                       llama.context_length u32              = 2048
llama_model_loader: - kv   5:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   6:                          llama.block_count u32              = 40
llama_model_loader: - kv   7:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   8:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   9:                 llama.attention.head_count u32              = 40
llama_model_loader: - kv  10:              llama.attention.head_count_kv u32              = 40
llama_model_loader: - kv  11:     llama.attention.layer_norm_rms_epsilon f32              = 0.000005
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q8_0:  282 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 5.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 12.88 GiB (8.50 BPW)
llm_load_print_meta: general.name     = llama-2-13b-chat.ggmlv3.q8_0.bin
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/41 layers to GPU
llm_load_tensors:        CPU buffer size = 13189.86 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   400.00 MiB
llama_new_context_with_model: KV self size  =  400.00 MiB, K (f16):  200.00 MiB, V (f16):  200.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    12.01 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    80.00 MiB
llama_new_context_with_model: graph splits (measure): 1
Available slots:
 -> Slot 0 - max context: 512
{"timestamp":1708584999,"level":"INFO","function":"main","line":2752,"message":"model loaded"}
all slots are idle and system prompt is empty, clear the KV cache

[5]	Post some questions like follows and verify it works normally. The response time and response contents will vary depending on the question and the model used. By the way, this example is running on a machine with 8 vCPU + 16G memory + GeForce RTX 3060 (12G).

[cent@dlp llama.cpp]$

curl -s -XPOST -H 'Content-Type: application/json' localhost:8000/v1/chat/completions \
-d '{"messages": [{"role": "user", "content": "What is the highest price of the Nikkei Stock Average on the Tokyo Stock Exchange?"}]}' | jq | sed -e 's/\\n/\n/g'


print_timings: prompt eval time =    3638.34 ms /    47 tokens (   77.41 ms per token,    12.92 tokens per second)
print_timings:        eval time =   64947.79 ms /   102 runs   (  636.74 ms per token,     1.57 tokens per second)
print_timings:       total time =   68586.13 ms
slot 0 released (149 tokens in cache)
{"timestamp":1708585483,"level":"INFO","function":"log_server_request","line":2510,"message":"request","remote_addr":"127.0.0.1","remote_port":59018,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "The highest price of the Nikkei Stock Average on the Tokyo Stock Exchange was 38,915.47 on December 29, 1989. However, please note that this is a historical data point and may not reflect current market conditions or future performance. It's important to do your own research and consult with a financial advisor before making any investment decisions. Is there anything else I can help you with?",
        "role": "assistant"
      }
    }
  ],
  "created": 1708585483,
  "id": "chatcmpl-Lk7RjQkTHFIdrueDfIIW7g8P7CaE8jh8",
  "model": "unknown",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 102,
    "prompt_tokens": 47,
    "total_tokens": 149
  }
}

Matched Content