diff --git a/README.md b/README.md index 93ccbcbb..466f315a 100644 --- a/README.md +++ b/README.md @@ -197,6 +197,18 @@ ollama show llama3.1 ollama list ``` +### List which models are currently loaded + +``` +ollama ps +``` + +### Stop a model which is currently running + +``` +ollama stop llama3.1 +``` + ### Start Ollama `ollama serve` is used when you want to start ollama without running the desktop application. diff --git a/docs/api.md b/docs/api.md index 1ae60dc7..95e79e00 100644 --- a/docs/api.md +++ b/docs/api.md @@ -407,6 +407,33 @@ A single JSON object is returned: } ``` +#### Unload a model + +If an empty prompt is provided and the `keep_alive` parameter is set to `0`, a model will be unloaded from memory. + +##### Request + +```shell +curl http://localhost:11434/api/generate -d '{ + "model": "llama3.1", + "keep_alive": 0 +}' +``` + +##### Response + +A single JSON object is returned: + +```json +{ + "model": "llama3.1", + "created_at": "2024-09-12T03:54:03.516566Z", + "response": "", + "done": true, + "done_reason": "unload" +} +``` + ## Generate a chat completion ```shell @@ -736,6 +763,64 @@ curl http://localhost:11434/api/chat -d '{ } ``` +#### Load a model + +If the messages array is empty, the model will be loaded into memory. + +##### Request + +``` +curl http://localhost:11434/api/chat -d '{ + "model": "llama3.1", + "messages": [] +}' +``` + +##### Response +```json +{ + "model": "llama3.1", + "created_at":"2024-09-12T21:17:29.110811Z", + "message": { + "role": "assistant", + "content": "" + }, + "done_reason": "load", + "done": true +} +``` + +#### Unload a model + +If the messages array is empty and the `keep_alive` parameter is set to `0`, a model will be unloaded from memory. + +##### Request + +``` +curl http://localhost:11434/api/chat -d '{ + "model": "llama3.1", + "messages": [], + "keep_alive": 0 +}' +``` + +##### Response + +A single JSON object is returned: + +```json +{ + "model": "llama3.1", + "created_at":"2024-09-12T21:33:17.547535Z", + "message": { + "role": "assistant", + "content": "" + }, + "done_reason": "unload", + "done": true +} +``` + ## Create a Model ```shell diff --git a/docs/faq.md b/docs/faq.md index 6267ad2b..b2b1ca30 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -237,9 +237,13 @@ ollama run llama3.1 "" ## How do I keep a model loaded in memory or make it unload immediately? -By default models are kept in memory for 5 minutes before being unloaded. This allows for quicker response times if you are making numerous requests to the LLM. You may, however, want to free up the memory before the 5 minutes have elapsed or keep the model loaded indefinitely. Use the `keep_alive` parameter with either the `/api/generate` and `/api/chat` API endpoints to control how long the model is left in memory. +By default models are kept in memory for 5 minutes before being unloaded. This allows for quicker response times if you're making numerous requests to the LLM. If you want to immediately unload a model from memory, use the `ollama stop` command: -The `keep_alive` parameter can be set to: +```shell +ollama stop llama3.1 +``` + +If you're using the API, use the `keep_alive` parameter with the `/api/generate` and `/api/chat` endpoints to set the amount of time that a model stays in memory. The `keep_alive` parameter can be set to: * a duration string (such as "10m" or "24h") * a number in seconds (such as 3600) * any negative number which will keep the model loaded in memory (e.g. -1 or "-1m") @@ -255,9 +259,9 @@ To unload the model and free up memory use: curl http://localhost:11434/api/generate -d '{"model": "llama3.1", "keep_alive": 0}' ``` -Alternatively, you can change the amount of time all models are loaded into memory by setting the `OLLAMA_KEEP_ALIVE` environment variable when starting the Ollama server. The `OLLAMA_KEEP_ALIVE` variable uses the same parameter types as the `keep_alive` parameter types mentioned above. Refer to section explaining [how to configure the Ollama server](#how-do-i-configure-ollama-server) to correctly set the environment variable. +Alternatively, you can change the amount of time all models are loaded into memory by setting the `OLLAMA_KEEP_ALIVE` environment variable when starting the Ollama server. The `OLLAMA_KEEP_ALIVE` variable uses the same parameter types as the `keep_alive` parameter types mentioned above. Refer to the section explaining [how to configure the Ollama server](#how-do-i-configure-ollama-server) to correctly set the environment variable. -If you wish to override the `OLLAMA_KEEP_ALIVE` setting, use the `keep_alive` API parameter with the `/api/generate` or `/api/chat` API endpoints. +The `keep_alive` API parameter with the `/api/generate` and `/api/chat` API endpoints will override the `OLLAMA_KEEP_ALIVE` setting. ## How do I manage the maximum number of requests the Ollama server can queue?