Commit graph

44 commits

Author SHA1 Message Date
Michael Yang 7bd7b02712 make patches git am-able
raw diffs can be applied using `git apply` but not with `git am`. git
patches, e.g. through `git format-patch` are both apply-able and am-able
2024-09-17 15:26:40 -07:00
Jeffrey Morgan 5e2653f9fe
llm: update llama.cpp commit to 8962422 (#6618) 2024-09-03 21:12:39 -04:00
Daniel Hiltgen 90ca84172c
Fix embeddings memory corruption (#6467)
* Fix embeddings memory corruption

The patch was leading to a buffer overrun corruption.  Once removed though, parallism
in server.cpp lead to hitting an assert due to slot/seq IDs being >= token count.  To
work around this, only use slot 0 for embeddings.

* Fix embed integration test assumption

The token eval count has changed with recent llama.cpp bumps (0.3.5+)
2024-08-22 14:51:42 -07:00
Jeffrey Morgan e04c7012c2
update llama.cpp submodule to 1e6f6554 (#6208) 2024-08-06 15:11:45 -04:00
Michael Yang 0f3271db88 patches: phi3 default sliding window attention 2024-07-31 14:58:34 -07:00
jmorganca afa8d6e9d5 patch gemma support 2024-07-30 18:07:29 -07:00
Jeffrey Morgan 68ee42f995
update llama.cpp submodule to 6eeaeba1 (#6039) 2024-07-29 13:20:26 -07:00
Jeffrey Morgan f2a96c7d77
llm: keep patch for llama 3 rope factors (#5987) 2024-07-26 15:20:52 -07:00
Jeffrey Morgan bbf8f102ee
Revert "llm(llama): pass rope factors (#5924)" (#5963)
This reverts commit bb46bbcf5e.
2024-07-25 18:24:55 -04:00
Michael Yang bb46bbcf5e
llm(llama): pass rope factors (#5924) 2024-07-24 16:05:59 -04:00
Jeffrey Morgan f8fedbda20
Update llama.cpp submodule commit to d94c6e0c (#5805) 2024-07-22 12:42:00 -04:00
Jeffrey Morgan 5534f2cc6a
llm: consider head_dim in llama arch (#5817) 2024-07-20 21:48:12 -04:00
Jeffrey Morgan 1475eab95f
add patch for tekken (#5807) 2024-07-20 13:41:21 -04:00
Jeffrey Morgan 571dc61955
Update llama.cpp submodule to a8db2a9c (#5530) 2024-07-07 13:03:09 -04:00
Jeffrey Morgan 8f8e736b13
update llama.cpp submodule to d7fd29f (#5475) 2024-07-05 13:25:58 -04:00
Jeffrey Morgan e9188e971a
Fix assert on small embedding inputs (#5491)
* Fix assert on small embedding inputs

* Update llm/patches/09-pooling.diff
2024-07-05 11:20:57 -04:00
Daniel Hiltgen 6298f49816 Fix clip model loading with unicode paths
On windows, if the model dir contained unicode characters
clip models would fail to load.  This fixes the file name
handling in clip.cpp to support utf16 on windows.
2024-07-03 12:46:36 -07:00
Jeffrey Morgan 4d311eb731
llm: architecture patch (#5316) 2024-06-26 21:38:12 -07:00
Jeffrey Morgan 152fc202f5
llm: update llama.cpp commit to 7c26775 (#4896)
* llm: update llama.cpp submodule to `7c26775`

* disable `LLAMA_BLAS` for now

* `-DLLAMA_OPENMP=off`
2024-06-17 15:56:16 -04:00
Jeffrey Morgan ce0dc33cb8
llm: patch to fix qwen 2 temporarily on nvidia (#4897) 2024-06-06 23:14:33 -07:00
Jeffrey Morgan 22f5c12ced
Update llama.cpp submodule to 5921b8f0 (#4731)
* update llama.cpp submodule to `5921b8f089d3b7bda86aac5a66825df6a6c10603`

* add patch
2024-05-30 16:20:22 -07:00
Michael Yang 714adb8bd1
bump (#4597) 2024-05-23 14:16:26 -07:00
Daniel Hiltgen b37b496a12 Wire up load progress
This doesn't expose a UX yet, but wires the initial server portion
of progress reporting during load
2024-05-23 13:36:48 -07:00
Jeffrey Morgan 583c1f472c
update llama.cpp submodule to 614d3b9 (#4414) 2024-05-16 13:53:09 -07:00
Jeffrey Morgan 1b0e6c9c0e
Fix llava models not working after first request (#4164)
* fix llava models not working after first request

* individual requests only for llava models
2024-05-05 20:50:31 -07:00
Daniel Hiltgen 85801317d1 Fix clip log import 2024-04-26 09:43:46 -07:00
jmorganca ddf5c09a9b use matrix multiplcation kernels in more cases 2024-04-25 13:58:54 -07:00
Daniel Hiltgen 0035e31af8 Bump to b2581 2024-04-02 11:53:07 -07:00
Daniel Hiltgen 43799532c1 Bump llama.cpp to b2474
The release just before ggml-cuda.cu refactoring
2024-03-23 09:54:56 +01:00
Michael Yang 291c663865 fix: clip memory leak 2024-03-14 13:12:42 -07:00
Jeffrey Morgan e72c567cfd
restore locale patch (#3091) 2024-03-12 22:08:13 -07:00
Bruce MacDonald b80661e8c7
relay load model errors to the client (#3065) 2024-03-11 16:48:27 -04:00
Jeffrey Morgan 369eda65f5
update llama.cpp submodule to ceca1ae (#3064) 2024-03-11 12:57:48 -07:00
Jeffrey Morgan 41b00b9856 fix 03-locale.diff 2024-03-10 16:21:05 -07:00
Jeffrey Morgan 908005d90b
patch: use default locale in wpm tokenizer (#3034) 2024-03-09 21:12:12 -08:00
Jeffrey Morgan 1ffb1e2874
update llama.cpp submodule to 77d1ac7 (#3030) 2024-03-09 15:55:34 -08:00
Jeffrey Morgan 0e4669b04f
update llama.cpp submodule to 6cdabe6 (#2999) 2024-03-08 00:26:20 -08:00
Jeffrey Morgan 21347e1ed6
update llama.cpp submodule to c29af7e (#2868) 2024-03-01 15:26:04 -08:00
Jeffrey Morgan 4613a080e7
update llama.cpp submodule to 66c1968f7 (#2618) 2024-02-20 17:42:31 -05:00
Daniel Hiltgen fc39a6cd7a Fix cuda leaks
This should resolve the problem where we don't fully unload from the GPU
when we go idle.
2024-02-18 18:37:20 -08:00
Jeffrey Morgan 26b13fc33c
patch: always add token to cache_tokens (#2459) 2024-02-12 08:10:16 -08:00
Daniel Hiltgen de76b95dd4 Bump llama.cpp to b2081 2024-02-06 12:06:43 -08:00
Daniel Hiltgen 72b12c3be7 Bump llama.cpp to b1999
This requires an upstream change to support graceful termination,
carried as a patch.
2024-01-30 16:52:12 -08:00
Jeffrey Morgan a64570dcae
Fix clearing kv cache between requests with the same prompt (#2186)
* Fix clearing kv cache between requests with the same prompt

* fix powershell script
2024-01-25 13:46:20 -08:00