Ubuntu 22.04
Sponsored Link

llama-cpp-python : インストール2024/02/16

 

Meta の Llama (Large Language Model Meta AI) モデルのインターフェースである [llama.cpp] の Python バインディング [llama-cpp-python] をインストールします。
以下は GPU 無しで実行できます。

[1]

こちらを参考に Python 3 をインストールしておきます

[2] その他必要なパッケージをインストールしておきます。
root@dlp:~#
apt -y install python3-pip python3-dev python3-venv gcc g++ make jq

[3] 任意の一般ユーザーでログインして、[llama-cpp-python] インストール用の Python 仮想環境を準備します。
ubuntu@dlp:~$
python3 -m venv --system-site-packages ~/llama

ubuntu@dlp:~$
source ~/llama/bin/activate

(llama) ubuntu@dlp:~$
[4] [llama-cpp-python] をインストールします。
(llama) ubuntu@dlp:~$
pip3 install llama-cpp-python[server]

Collecting llama-cpp-python[server]
  Downloading llama_cpp_python-0.2.44.tar.gz (36.6 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done

.....
.....

Successfully installed annotated-types-0.6.0 anyio-4.2.0 diskcache-5.6.3 exceptiongroup-1.2.0 fastapi-0.109.2 h11-0.14.0 llama-cpp-python-0.2.44 numpy-1.26.4 pydantic-2.6.1 pydantic-core-2.16.2 pydantic-settings-2.1.0 python-dotenv-1.0.1 sniffio-1.3.0 sse-starlette-2.0.0 starlette-0.36.3 starlette-context-0.3.6 typing-extensions-4.9.0 uvicorn-0.27.1
[5]

[llama.cpp] で使用可能な GGUF 形式のモデルをダウンロードして、[llama-cpp-python] を起動します。
モデルは下記サイトからダウンロードできます。当例では [llama-2-13b-chat.Q4_K_M.gguf] を使用します。

⇒ https://huggingface.co/TheBloke/Llama-2-7B-chat-GGUF/tree/main
⇒ https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/tree/main
⇒ https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGUF/tree/main
(llama) ubuntu@dlp:~$
wget https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_K_M.gguf

(llama) ubuntu@dlp:~$
python3 -m llama_cpp.server --model ./llama-2-13b-chat.Q4_K_M.gguf --host 0.0.0.0 --port 8000 &

llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from ./llama-2-13b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   4:                          llama.block_count u32              = 40
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 40
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 40
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_K:  241 tensors
llama_model_loader: - type q6_K:   41 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 7.33 GiB (4.83 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size =  7500.85 MiB
..........................warning: failed to mlock 58060800-byte buffer (after previously locking 2057080832 bytes): Cannot allocate memory
Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root).
..........................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  1600.00 MiB
llama_new_context_with_model: KV self size  = 1600.00 MiB, K (f16):  800.00 MiB, V (f16):  800.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    15.01 MiB
llama_new_context_with_model:        CPU compute buffer size =   200.00 MiB
llama_new_context_with_model: graph splits (measure): 1
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
Model metadata: {'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.context_length': '4096', 'general.name': 'LLaMA v2', 'llama.embedding_length': '5120', 'llama.feed_forward_length': '13824', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'llama.attention.head_count': '40', 'tokenizer.ggml.bos_token_id': '1', 'llama.block_count': '40', 'llama.attention.head_count_kv': '40', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '15'}
INFO:     Started server process [2933]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
[6] ローカルネットワーク内の任意のコンピューターから [http://(サーバーのホスト名 または IP アドレス):8000/docs] にアクセスすると、ドキュメントを参照することができます。
[7] 簡単な質問を投入して動作確認します。
質問の内容や使用しているモデルによって、応答時間や応答内容はまちまちですが、応答時間については CPU のみで実行しているため、ある程度はかかります。
ちなみに、当例では、8 vCPU + 16G メモリ のマシンで実行しています。
# あなた誰?

(llama) ubuntu@dlp:~$
curl -s -XPOST -H 'Content-Type: application/json' localhost:8000/v1/chat/completions \
-d '{"messages": [{"role": "user", "content": "Who are you?"}]}' | jq


llama_print_timings:        load time =    1926.79 ms
llama_print_timings:      sample time =      16.65 ms /    76 runs   (    0.22 ms per token,  4564.29 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =   22648.86 ms /    76 runs   (  298.01 ms per token,     3.36 tokens per second)
llama_print_timings:       total time =   22809.91 ms /    77 tokens
INFO:     127.0.0.1:43678 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{
  "id": "chatcmpl-dccf551d-f0be-4449-835f-8bdc8c1a036d",
  "object": "chat.completion",
  "created": 1708061706,
  "model": "./llama-2-13b-chat.Q4_K_M.gguf",
  "choices": [
    {
      "index": 0,
      "message": {
        "content": "  Hello! My name is LLaMA, I'm an AI trained by a team of researcher at Meta AI. My primary function is to assist with tasks and answer questions to the best of my ability. I am capable of understanding and responding to human input in a conversational manner. Please let me know how I can be of assistance today!",
        "role": "assistant"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 75,
    "total_tokens": 90
  }
}

# 広島について教えて

(llama) ubuntu@dlp:~$
curl -s -XPOST -H 'Content-Type: application/json' localhost:8000/v1/chat/completions \
-d '{"messages": [{"role": "user", "content": "Tell me about Hiroshima city, Japan."}]}' | jq | sed -e 's/\\n/\n/g'


llama_print_timings:        load time =    1926.79 ms
llama_print_timings:      sample time =     114.43 ms /   526 runs   (    0.22 ms per token,  4596.58 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =  156558.75 ms /   526 runs   (  297.64 ms per token,     3.36 tokens per second)
llama_print_timings:       total time =  157927.03 ms /   527 tokens
INFO:     127.0.0.1:34374 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{
  "id": "chatcmpl-662e47ec-5617-4809-b639-f111d3c04a55",
  "object": "chat.completion",
  "created": 1708062685,
  "model": "./llama-2-13b-chat.Q4_K_M.gguf",
  "choices": [
    {
      "index": 0,
      "message": {
        "content": "  Sure, I'd be happy to tell you about Hiroshima City, Japan!

Hiroshima is a city located in the Chugoku region of western Japan. It is best known for being the first city in the world to be targeted by an atomic bomb when it was dropped by the United States on August 6, 1945, during the final stages of World War II. The bombing killed an estimated 70,000 people instantly, and another 70,000 died from injuries and radiation sickness in the months and years that followed. Today, Hiroshima is a thriving city with a population of over 1 million people.

The Hiroshima Peace Memorial Park, which includes the Atomic Bomb Dome (the ruins of the prefectural headquarters building that survived the blast), the Children's Peace Monument, and the Memorial Museum, serves as a reminder of the devastating effects of nuclear weapons and the importance of promoting peace and disarmament. The park was dedicated to the memory of the victims of the atomic bombing and as a symbol of hope for peace and nuclear disarmament.

In addition to its historical significance, Hiroshima is also known for its traditional cuisine, which includes dishes such as okonomiyaki (a savory pancake made with batter, vegetables, and meat) and oysters, as well as its festivals and cultural events, such as the Hiroshima Festival, which takes place in August and features traditional music and dance performances, and the Hiroshima Animation Festival, which showcases the work of animators from around the world.

The city also has several notable landmarks and attractions, including the Hiroshima Castle, which was built in the 16th century and has been reconstructed several times after being destroyed by wars and natural disasters; the Miyajima Island, which is famous for its beautiful scenery and historic landmarks such as the famous Itsukushima Shrine, which appears to be floating on water during high tide; and the Hiroshima Museum of Art, which features a collection of modern and contemporary art from Japanese and international artists.

Overall, Hiroshima is a city with a rich history and culture, and it continues to play an important role in promoting peace and disarmament around the world.",
        "role": "assistant"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 23,
    "completion_tokens": 525,
    "total_tokens": 548
  }
}
[8]

日本語モデルもいくつか提供されています。
例として、株式会社 ELYZA さんが公開している日本語モデルを GGUF 形式に変換されたモデルを試します。

⇒ https://huggingface.co/mmnga/ELYZA-japanese-Llama-2-7b-fast-instruct-gguf
⇒ https://huggingface.co/mmnga/ELYZA-japanese-Llama-2-13b-fast-instruct-gguf
(llama) ubuntu@dlp:~$
curl -LO https://huggingface.co/mmnga/ELYZA-japanese-Llama-2-13b-fast-instruct-gguf/resolve/main/ELYZA-japanese-Llama-2-13b-fast-instruct-q5_0.gguf
(llama) ubuntu@dlp:~$
python3 -m llama_cpp.server --model ./ELYZA-japanese-Llama-2-13b-fast-instruct-q5_0.gguf --host 0.0.0.0 --port 8000 &
(llama) ubuntu@dlp:~$
curl -s -XPOST -H 'Content-Type: application/json' localhost:8000/v1/chat/completions \
-d '{"messages": [{"role": "user", "content": "次回のオリンピックはどこで開催?"}]}' | jq


llama_print_timings:        load time =    3266.01 ms
llama_print_timings:      sample time =       5.81 ms /    21 runs   (    0.28 ms per token,  3614.46 tokens per second)
llama_print_timings: prompt eval time =    2643.76 ms /    12 tokens (  220.31 ms per token,     4.54 tokens per second)
llama_print_timings:        eval time =    7751.69 ms /    20 runs   (  387.58 ms per token,     2.58 tokens per second)
llama_print_timings:       total time =   10477.29 ms /    32 tokens
INFO:     127.0.0.1:33286 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{
  "id": "chatcmpl-c5472ea6-3ed2-438e-9b15-247a322468b3",
  "object": "chat.completion",
  "created": 1708067530,
  "model": "./ELYZA-japanese-Llama-2-13b-fast-instruct-q5_0.gguf",
  "choices": [
    {
      "index": 0,
      "message": {
        "content": " 2024年夏季オリンピックの開催地は、フランス・パリに決定しました。",
        "role": "assistant"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 18,
    "completion_tokens": 20,
    "total_tokens": 38
  }
}
関連コンテンツ