llama-cpp-python : Install (GPU)2024/02/19

	Install the Python binding [llama-cpp-python] for [llama.cpp], taht is the interface for Meta's Llama (Large Language Model Meta AI) model. The example below is with GPU.
[1]	Install Python 3, refer to here.
[2]	Install CUDA 12, refer to here.
[3]	Install other required packages.

[root@dlp ~]#

dnf -y install cudnn9-cuda-12 python3-pip python3-devel gcc gcc-c++ cmake jq

[4]	Login as a common user and prepare Python virtual environment to install [llama-cpp-python].

[cent@dlp ~]$

python3 -m venv --system-site-packages ~/llama

[cent@dlp ~]$

source ~/llama/bin/activate

(llama) [cent@dlp ~]$

[5]	Install [llama-cpp-python].

(llama) [cent@dlp ~]$

export LLAMA_CUBLAS=1 FORCE_CMAKE=1 CMAKE_ARGS="-DLLAMA_CUBLAS=on"

(llama) [cent@dlp ~]$

pip3 install llama-cpp-python[server]

Collecting llama-cpp-python[server]
  Downloading llama_cpp_python-0.2.44.tar.gz (36.6 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
    Preparing wheel metadata ... done

.....
.....

Successfully installed MarkupSafe-2.1.5 annotated-types-0.6.0 anyio-4.2.0 click-8.1.7 diskcache-5.6.3 exceptiongroup-1.2.0 fastapi-0.109.2 h11-0.14.0 jinja2-3.1.3 llama-cpp-python-0.2.44 numpy-1.26.4 pydantic-2.6.1 pydantic-core-2.16.2 pydantic-settings-2.2.0 python-dotenv-1.0.1 sniffio-1.3.0 sse-starlette-2.0.0 starlette-0.36.3 starlette-context-0.3.6 typing-extensions-4.9.0 uvicorn-0.27.1

[6]

Download the GGUF format model that it can use them in [llama.cpp] and start [llama-cpp-python].
It's possible to download models from the following site. In this example, we will use [llama-2-13b-chat.Q4_K_M.gguf].

⇒ https://huggingface.co/TheBloke/Llama-2-7B-chat-GGUF/tree/main
⇒ https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/tree/main
⇒ https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGUF/tree/main

(llama) [cent@dlp ~]$

wget https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_K_M.gguf

# [--n_gpu_layers] : number of layers to put on the GPU
# -- specify [-1] to use all if you do not know

(llama) [cent@dlp ~]$

python3 -m llama_cpp.server --model ./llama-2-13b-chat.Q4_K_M.gguf --n_gpu_layers -1 --host 0.0.0.0 --port 8000 &

(llama) [cent@dlp ~]$ ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from ./llama-2-13b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   4:                          llama.block_count u32              = 40
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 40
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 40
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_K:  241 tensors
llama_model_loader: - type q6_K:   41 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 7.33 GiB (4.83 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.28 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:        CPU buffer size =    87.89 MiB
llm_load_tensors:      CUDA0 buffer size =  7412.96 MiB
warning: failed to mlock 92905472-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root).
....................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1600.00 MiB
llama_new_context_with_model: KV self size  = 1600.00 MiB, K (f16):  800.00 MiB, V (f16):  800.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    15.01 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   204.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    10.00 MiB
llama_new_context_with_model: graph splits (measure): 3
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
Model metadata: {'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.context_length': '4096', 'general.name': 'LLaMA v2', 'llama.embedding_length': '5120', 'llama.feed_forward_length': '13824', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'llama.attention.head_count': '40', 'tokenizer.ggml.bos_token_id': '1', 'llama.block_count': '40', 'llama.attention.head_count_kv': '40', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '15'}
INFO:     Started server process [1717]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

[7]	You can read the documentation by accessing [http://(server hostname or IP address):8000/docs] from any computer in your local network.

[8]

Post some questions like follows and verify it works normally.
The response time and response contents will vary depending on the question and the model used.
By the way, this example is running on a machine with 8 vCPU + 16G memory + GeForce RTX 3060 (12G).
When asked the same question on the same machine, the processing speed is more than 5 times faster than this case without GPU.

(llama) [cent@dlp ~]$

curl -s -XPOST -H 'Content-Type: application/json' localhost:8000/v1/chat/completions \
-d '{"messages": [{"role": "user", "content": "Introduce yourself."}]}' | jq


llama_print_timings:        load time =     256.17 ms
llama_print_timings:      sample time =      58.24 ms /    94 runs   (    0.62 ms per token,  1613.90 tokens per second)
llama_print_timings: prompt eval time =     256.06 ms /    16 tokens (   16.00 ms per token,    62.48 tokens per second)
llama_print_timings:        eval time =    2593.86 ms /    93 runs   (   27.89 ms per token,    35.85 tokens per second)
llama_print_timings:       total time =    3170.25 ms /   109 tokens
INFO:     127.0.0.1:51506 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{
  "id": "chatcmpl-6ccaa09e-8d47-41d1-900b-3aaa0f28dd0d",
  "object": "chat.completion",
  "created": 1708314274,
  "model": "./llama-2-13b-chat.Q4_K_M.gguf",
  "choices": [
    {
      "index": 0,
      "message": {
        "content": "  Hello! My name is LLaMA, I'm a large language model trained by a team of researcher at Meta AI. My capabilities are centered around generating human-like text based on the input I receive. My primary function is to understand and process natural language, and generate appropriate responses that are contextually relevant and coherent. I can answer questions, provide information, and engage in conversation. So feel free to ask me anything!",
        "role": "assistant"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 16,
    "completion_tokens": 93,
    "total_tokens": 109
  }
}

(llama) [cent@dlp ~]$

curl -s -XPOST -H 'Content-Type: application/json' localhost:8000/v1/chat/completions \
-d '{"messages": [{"role": "user", "content": "Tell me about Hiroshima city, Japan."}]}' | jq | sed -e 's/\\n/\n/g'

llama_print_timings: load time = 256.17 ms
llama_print_timings: sample time = 697.91 ms / 1194 runs ( 0.58 ms per token, 1710.81 tokens per second)
llama_print_timings: prompt eval time = 234.63 ms / 17 tokens ( 13.80 ms per token, 72.45 tokens per second)
llama_print_timings: eval time = 35294.38 ms / 1193 runs ( 29.58 ms per token, 33.80 tokens per second)
llama_print_timings: total time = 40329.88 ms / 1210 tokens
INFO: 127.0.0.1:48448 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{
"id": "chatcmpl-a8b78acd-b379-40be-8cf6-c4b53b6d1bb8",
"object": "chat.completion",
"created": 1708314671,
"model": "./llama-2-13b-chat.Q4_K_M.gguf",
"choices": [
{
"index": 0,
"message": {
"content": " Sure, I'd be happy to help! Here are some key points about Hiroshima City, Japan:

1. Location: Hiroshima is a city located in the Chugoku region of western Japan. It is situated on the banks of the Ota River and is known for its beautiful scenery and rich cultural heritage.
2. History: Hiroshima was founded in 1589 by Mori Terumoto, a powerful daimyo (feudal lord) who ruled the region during the Sengoku period (1467-1603). The city played an important role in Japanese history, particularly during World War II when it was the first city in the world to be destroyed by an atomic bomb on August 6, 1945. The bombing killed an estimated 70,000 to 80,000 people immediately, and many more died in the following months and years from radiation sickness. Today, Hiroshima is known for its efforts towards peace and disarmament.
3. Attractions: Some of the most popular attractions in Hiroshima include:
* Hiroshima Peace Memorial Park: This park was built on the site where the atomic bomb was dropped. It features several memorials, museums, and monuments dedicated to the victims of the bombing and the promotion of peace. The park also includes the Hiroshima Peace Memorial Museum, which features exhibits on the history of the bombing and its aftermath.
* Atomic Bomb Dome (Genbaku Dom): This building was one of the few structures that survived the atomic bombing and is now a UNESCO World Heritage Site. It has been preserved in its ruined state as a reminder of the devastating effects of nuclear weapons.
* Miyajima Island: Located just off the coast of Hiroshima, this island is famous for its beautiful scenery and historic landmarks. It is home to the famous Itsukushima Shrine, which is listed as a UNESCO World Heritage Site.
* Hiroshima Castle: This castle was built in the 16th century and is one of the few castles in Japan that survived the atomic bombing. It has been restored and now features a museum that showcases the history of the castle and the region.
4. Food: Hiroshima is known for its local cuisine, which includes dishes such as okonomiyaki (a savory pancake made with batter and various ingredients), kakiage (a savory fritter made with vegetables and batter), and oyster dishes (Hiroshima is famous for its oysters). Other local specialties include Hiroshima-style ramen (a type of noodle soup), Hiroshima-style yakisoba (a type of stir-fried noodles), and Hiroshima-style udon (a type of thick wheat flour noodles).
5. Events: Some of the major events held in Hiroshima include the Hiroshima Peace Memorial Ceremony (held on August 6 every year), the Hiroshima Festivals (held in August and September), and the Miyajima Island Fireworks Festival (held in August every year).
6. Education: Hiroshima has several universities and research institutions, including Hiroshima University, which is one of the largest universities in Japan and has a strong focus on research and education in fields such as medicine, engineering, and humanities. Other notable educational institutions in Hiroshima include the Hiroshima Prefectural University and the Hiroshima Jogakuen College.
7. Economy: The economy of Hiroshima is driven by a variety of industries, including manufacturing, agriculture, and tourism. Some of the major industries in Hiroshima include automotive manufacturing (Toyota and Mazda have factories in the region), electronics manufacturing (Hitachi and Panasonic have factories in the region), and agriculture (the region is known for its high-quality produce, including rice, soybeans, and vegetables).
8. Climate: Hiroshima has a humid subtropical climate with hot summers and cool winters. The average temperature in August (the hottest month) is around 28°C (82°F), while the average temperature in January (the coldest month) is around 2°C (36°F).
9. Transportation: Hiroshima is served by two major airports, Hiroshima Airport and Miyajima Airport, as well as several major train stations, including Hiroshima Station and Miyajimaguchi Station. The city is also connected to other major cities in Japan by an extensive network of highways and trains.
10. Culture: Hiroshima has a rich cultural heritage, with many festivals and events held throughout the year. Some of the most notable cultural events in Hiroshima include the Hiroshima Festival (held in August every year), the Miyajima Island Fireworks Festival (held in August every year), and the Hiroshima Jazz Festival (held in November every year). The city is also home to several museums and cultural institutions, including the Hiroshima Peace Memorial Museum and the Hiroshima Museum of Art.",
"role": "assistant"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 23,
"completion_tokens": 1193,
"total_tokens": 1216
}
}

Matched Content