Commit graph

42 commits

Author SHA1 Message Date
Jeffrey Morgan b24e8d17b2
Increase minimum CUDA memory allocation overhead and fix minimum overhead for multi-gpu (#1896)
* increase minimum cuda overhead and fix minimum overhead for multi-gpu

* fix multi gpu overhead

* limit overhead to 10% of all gpus

* better wording

* allocate fixed amount before layers

* fixed only includes graph alloc
2024-01-10 19:08:51 -05:00
Jeffrey Morgan f387e9631b use runner if cuda alloc won't fit 2024-01-09 00:44:34 -05:00
Jeffrey Morgan cb534e6ac2 use 10% vram overhead for cuda 2024-01-08 23:17:44 -05:00
Jeffrey Morgan 58ce2d8273 better estimate scratch buffer size 2024-01-08 21:32:44 -05:00
Jeffrey Morgan 08f1e18965
Offload layers to GPU based on new model size estimates (#1850)
* select layers based on estimated model memory usage

* always account for scratch vram

* dont load +1 layers

* better estmation for graph alloc

* Update gpu/gpu_darwin.go

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

* Update llm/llm.go

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

* Update llm/llm.go

* add overhead for cuda memory

* Update llm/llm.go

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

* fix build error on linux

* address comments

---------

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2024-01-08 16:42:00 -05:00
Daniel Hiltgen e9ce91e9a6 Load dynamic cpu lib on windows
On linux, we link the CPU library in to the Go app and fall back to it
when no GPU match is found. On windows we do not link in the CPU library
so that we can better control our dependencies for the CLI.  This fixes
the logic so we correctly fallback to the dynamic CPU library
on windows.
2024-01-04 08:41:41 -08:00
Jeffrey Morgan c0285158a9 tweak memory requirements error text 2024-01-03 19:47:18 -05:00
Jeffrey Morgan 77a66df72c add macOS memory check for 47B models 2024-01-03 19:46:16 -05:00
Jeffrey Morgan 5b4837f881 remove unused filetype check 2024-01-03 19:45:39 -05:00
Daniel Hiltgen 7555ea44f8 Revamp the dynamic library shim
This switches the default llama.cpp to be CPU based, and builds the GPU variants
as dynamically loaded libraries which we can select at runtime.

This also bumps the ROCm library to version 6 given 5.7 builds don't work
on the latest ROCm library that just shipped.
2023-12-20 14:45:57 -08:00
Daniel Hiltgen 3269535a4c Refine handling of shim presence
This allows the CPU only builds to work on systems with Radeon cards
2023-12-19 09:05:46 -08:00
Daniel Hiltgen 35934b2e05 Adapted rocm support to cgo based llama.cpp 2023-12-19 09:05:46 -08:00
Daniel Hiltgen d4cd695759 Add cgo implementation for llama.cpp
Run the server.cpp directly inside the Go runtime via cgo
while retaining the LLM Go abstractions.
2023-12-19 09:05:46 -08:00
Bruce MacDonald 811b1f03c8 deprecate ggml
- remove ggml runner
- automatically pull gguf models when ggml detected
- tell users to update to gguf in the case automatic pull fails

Co-Authored-By: Jeffrey Morgan <jmorganca@gmail.com>
2023-12-19 09:05:46 -08:00
Michael Yang b9495ea162 load projectors 2023-12-05 14:36:12 -08:00
Bruce MacDonald 195e3d9dbd
chat api endpoint (#1392) 2023-12-05 14:57:33 -05:00
Jeffrey Morgan 00d06619a1 Revert "chat api (#991)" while context variable is fixed
This reverts commit 7a0899d62d.
2023-12-04 21:16:27 -08:00
Bruce MacDonald 7a0899d62d
chat api (#991)
- update chat docs
- add messages chat endpoint
- remove deprecated context and template generate parameters from docs
- context and template are still supported for the time being and will continue to work as expected
- add partial response to chat history
2023-12-04 18:01:06 -05:00
Michael Yang 19b7a4d715 recent llama.cpp update added kernels for fp32, q5_0, and q5_1 2023-11-20 13:44:31 -08:00
Jeffrey Morgan 5cba29b9d6
JSON mode: add `"format" as an api parameter (#1051)
* add `"format": "json"` as an API parameter
---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2023-11-09 16:44:02 -08:00
Jeffrey Morgan 2e53704685
default rope params to 0 for new models (#968) 2023-11-02 08:41:30 -07:00
Jeffrey Morgan 7ed5a39bc7 simpler check for model loading compatibility errors 2023-10-19 14:50:49 -04:00
Jeffrey Morgan a7dad24d92
add error for falcon and starcoder vocab compatibility (#844)
add error for falcon and starcoder vocab compatibility
---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2023-10-19 12:18:31 -04:00
Michael Yang 36fe2deebf only check system memory on macos 2023-10-13 14:47:29 -07:00
Michael Yang 4a8931f634 check total (system + video) memory 2023-10-13 14:47:29 -07:00
Michael Yang bd6e38fb1a refactor memory check 2023-10-13 14:47:29 -07:00
Michael Yang 92189a5855 fix memory check 2023-10-13 14:47:29 -07:00
Michael Yang b599946b74 add format bytes 2023-10-11 14:08:23 -07:00
Bruce MacDonald d06bc0cb6e
enable q8, q5, 5_1, and f32 for linux gpu (#699) 2023-10-05 12:53:47 -04:00
Bruce MacDonald 86279f4ae3
unbound max num gpu layers (#591)
---------

Co-authored-by: Michael Yang <mxyng@pm.me>
2023-09-25 18:36:46 -04:00
Bruce MacDonald 4cba75efc5
remove tmp directories created by previous servers (#559)
* remove tmp directories created by previous servers

* clean up on server stop

* Update routes.go

* Update server/routes.go

Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>

* create top-level temp ollama dir

* check file exists before creating

---------

Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
2023-09-21 20:38:49 +01:00
Michael Yang 7dee25a07f fix falcon decode
get model and file type from bin file
2023-09-12 12:34:53 -07:00
Bruce MacDonald 09dd2aeff9
GGUF support (#441) 2023-09-07 13:55:37 -04:00
Bruce MacDonald 42998d797d
subprocess llama.cpp server (#401)
* remove c code
* pack llama.cpp
* use request context for llama_cpp
* let llama_cpp decide the number of threads to use
* stop llama runner when app stops
* remove sample count and duration metrics
* use go generate to get libraries
* tmp dir for running llm
2023-08-30 16:35:03 -04:00
Michael Yang b25dd1795d allow F16 to use metal
warning F16 uses significantly more memory than quantized model so the
standard requires don't apply.
2023-08-26 08:38:48 -07:00
Michael Yang 304f2b6c96 add 34b to mem check 2023-08-26 08:29:21 -07:00
Michael Yang a894cc792d model and file type as strings 2023-08-17 12:08:04 -07:00
Michael Yang e26085b921 close open files 2023-08-14 16:08:06 -07:00
Michael Yang 6de5d032e1 implement loading ggml lora adapters through the modelfile 2023-08-10 09:23:39 -07:00
Michael Yang d791df75dd check memory requirements before loading 2023-08-10 09:23:11 -07:00
Michael Yang 020a3b3530 disable gpu for q5_0, q5_1, q8_0 quants 2023-08-10 09:23:11 -07:00
Michael Yang fccf8d179f partial decode ggml bin for more info 2023-08-10 09:23:10 -07:00