Commit graph

514 commits

Author SHA1 Message Date
Michael Yang f5e8b207fb s/DisplayLongest/String/ 2024-05-06 15:24:01 -07:00
Michael Yang d245460362 only quantize language models 2024-05-06 15:24:01 -07:00
Michael Yang 4d0d0fa383 no iterator 2024-05-06 15:24:01 -07:00
Michael Yang 7ffe45734d rebase 2024-05-06 15:24:01 -07:00
Michael Yang 01811c176a comments 2024-05-06 15:24:01 -07:00
Michael Yang a7248f6ea8 update tests 2024-05-06 15:24:01 -07:00
Michael Yang 9685c34509 quantize any fp16/fp32 model
- FROM /path/to/{safetensors,pytorch}
- FROM /path/to/fp{16,32}.bin
- FROM model:fp{16,32}
2024-05-06 15:24:01 -07:00
Daniel Hiltgen 0963c65027
Merge pull request #4208 from dhiltgen/fix_sched_test
Fix stale test logic
2024-05-06 14:23:12 -07:00
Jeffrey Morgan c9f98622b1
Skip scheduling cancelled requests, always reload unloaded runners (#4189) 2024-05-06 14:22:24 -07:00
Daniel Hiltgen 0a954e5066 Fix stale test logic
The model processing was recently changed to be deferred but
this test scenario hadn't been adjusted for that change in behavior.
2024-05-06 14:15:37 -07:00
Jeffrey Morgan dfa2f32ca0
unload in critical section (#4187) 2024-05-05 17:18:27 -07:00
Daniel Hiltgen f56aa20014 Centralize server config handling
This moves all the env var reading into one central module
and logs the loaded config once at startup which should
help in troubleshooting user server logs
2024-05-05 16:49:50 -07:00
Jeffrey Morgan 942c979232
allocate a large enough kv cache for all parallel requests (#4162) 2024-05-05 15:59:32 -07:00
Patrick Devine 2a21363bb7
validate the format of the digest when getting the model path (#4175) 2024-05-05 11:46:12 -07:00
Daniel Hiltgen 20f6c06569 Make maximum pending request configurable
This also bumps up the default to be 50 queued requests
instead of 10.
2024-05-04 21:00:52 -07:00
Michael Yang b7a87a22b6
Merge pull request #4059 from ollama/mxyng/parser-2
rename parser to model/file
2024-05-03 13:01:22 -07:00
Daniel Hiltgen 9a32c514cb Soften timeouts on sched unit tests
This gives us more headroom on the scheduler tests to tamp
down some flakes.
2024-05-03 09:08:33 -07:00
Michael Yang e9ae607ece
Merge pull request #3892 from ollama/mxyng/parser
refactor modelfile parser
2024-05-02 17:04:47 -07:00
Michael Yang 5b806d8d24
Merge pull request #4089 from ollama/mxyng/target-invalid
server: destination invalid
2024-05-01 12:46:35 -07:00
Michael Yang 45b6a12e45 server: target invalid 2024-05-01 12:40:45 -07:00
Mark Ward 63c763685f log when the waiting for the process to stop to help debug when other tasks execute during this wait.
expire timer clear the timer reference because it will not be reused.
close will clean up expireTimer if calling code has not already done this.
2024-05-01 18:51:10 +00:00
Mark Ward f4a73d57a4 fix runner expire during active use. Clearing the expire timer as it is used. Allowing the finish to assign an expire timer so that the runner will expire after no use. 2024-05-01 18:51:10 +00:00
Michael Yang 119589fcb3 rename parser to model/file 2024-05-01 09:53:50 -07:00
Michael Yang 9cf0f2e973 use parser.Format instead of templating modelfile 2024-05-01 09:52:54 -07:00
Michael Yang c0a00f68ae refactor modelfile parser 2024-05-01 09:52:54 -07:00
Bruce MacDonald 0a7fdbe533
prompt to display and add local ollama keys to account (#3717)
- return descriptive error messages when unauthorized to create blob or push a model
- display the local public key associated with the request that was denied
2024-04-30 11:02:08 -07:00
Jeffrey Morgan 586672f490
fix copying model to itself (#4019) 2024-04-28 23:47:49 -04:00
Daniel Hiltgen d6e3b64582 Fix concurrency for CPU mode
Prior refactoring passes accidentally removed the logic to bypass VRAM
checks for CPU loads.  This adds that back, along with test coverage.

This also fixes loaded map access in the unit test to be behind the mutex which was
likely the cause of various flakes in the tests.
2024-04-28 13:42:39 -07:00
Jeffrey Morgan bb31def011
return code 499 when user cancels request while a model is loading (#3955) 2024-04-26 17:38:29 -04:00
Blake Mizerany 37f9c8ad99
types/model: overhaul Name and Digest types (#3924) 2024-04-26 13:08:32 -07:00
Daniel Hiltgen 9b5a3c5991
Merge pull request #3914 from dhiltgen/mac_perf
Improve mac parallel performance
2024-04-25 16:28:31 -07:00
Jeffrey Morgan 00b0699c75
Reload model if num_gpu changes (#3920)
* reload model if `num_gpu` changes

* dont reload on -1

* fix tests
2024-04-25 19:02:40 -04:00
Daniel Hiltgen b123be5b71 Adjust context size for parallelism 2024-04-25 13:58:54 -07:00
Daniel Hiltgen f503a848c2
Merge pull request #3895 from brycereitano/shiftloading
Move ggml loading to when attempting to fit
2024-04-25 09:24:08 -07:00
Bryce Reitano 36a6daccab Restructure loading conditional chain 2024-04-24 17:37:03 -06:00
Bryce Reitano ceb0e26e5e Provide variable ggml for TestLoad 2024-04-24 17:19:55 -06:00
Bryce Reitano 284e02bed0 Move ggml loading to when we attempt fitting 2024-04-24 17:17:24 -06:00
Michael Yang 592dae31c8 update copy to use model.Name 2024-04-24 15:54:54 -07:00
Daniel Hiltgen d8851cb7a0 Harden sched TestLoad
Give the go routine a moment to deliver the expired event
2024-04-23 16:14:47 -07:00
Daniel Hiltgen 34b9db5afc Request and model concurrency
This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
2024-04-22 19:29:12 -07:00
Cheng 62be2050dd
chore: use errors.New to replace fmt.Errorf will much better (#3789) 2024-04-20 22:11:06 -04:00
Patrick Devine 9f8691c6c8
Add llama2 / torch models for ollama create (#3607) 2024-04-15 11:26:42 -07:00
Jeffrey Morgan a0b8a32eb4
Terminate subprocess if receiving SIGINT or SIGTERM signals while model is loading (#3653)
* terminate subprocess if receiving `SIGINT` or `SIGTERM` signals while model is loading

* use `unload` in signal handler
2024-04-15 12:09:32 -04:00
Blake Mizerany a7b431e743
server: provide helpful workaround hint when stalling on pull (#3584)
This is a quick fix to help users who are stuck on the "pull" step at
99%.

In the near future we're introducing a new registry client that
should/will hopefully be smarter. In the meantime, this should unblock
the users hitting issue #1736.
2024-04-10 16:24:37 -07:00
Michael Yang 9502e5661f cgo quantize 2024-04-08 15:31:08 -07:00
Michael Yang e1c9a2a00f no blob create if already exists 2024-04-08 15:09:48 -07:00
Daniel Hiltgen 6589eb8a8c Revert options as a ref in the server 2024-04-02 16:44:10 -07:00
Daniel Hiltgen 58d95cc9bd Switch back to subprocessing for llama.cpp
This should resolve a number of memory leak and stability defects by allowing
us to isolate llama.cpp in a separate process and shutdown when idle, and
gracefully restart if it has problems.  This also serves as a first step to be
able to run multiple copies to support multiple models concurrently.
2024-04-01 16:48:18 -07:00
Patrick Devine 3b6a9154dd
Simplify model conversion (#3422) 2024-04-01 16:14:53 -07:00
Michael Yang 91b3e4d282 update memory calcualtions
count each layer independently when deciding gpu offloading
2024-04-01 13:16:32 -07:00