Commit graph

370 commits

Author SHA1 Message Date
Michael Yang a8b9b930b4 account for all non-repeating layers 2024-04-17 11:21:21 -07:00
Michael Yang 26df674785 scale graph based on gpu count 2024-04-16 14:44:13 -07:00
Jeffrey Morgan 7c9792a6e0
Support unicode characters in model path (#3681)
* parse wide argv characters on windows

* cleanup

* move cleanup to end of `main`
2024-04-16 17:00:12 -04:00
Michael Yang 41a272de9f darwin: no partial offloading if required memory greater than system 2024-04-16 11:22:38 -07:00
Jeffrey Morgan f335722275
update llama.cpp submodule to 7593639 (#3665) 2024-04-15 23:04:43 -04:00
Michael Yang 6d53b67c2c
Merge pull request #3663 from ollama/mxyng/fix-padding 2024-04-15 17:44:54 -07:00
Michael Yang 969238b19e fix padding in decode
TODO: update padding() to _only_ returning the padding
2024-04-15 17:27:06 -07:00
Patrick Devine 9f8691c6c8
Add llama2 / torch models for ollama create (#3607) 2024-04-15 11:26:42 -07:00
Jeffrey Morgan a0b8a32eb4
Terminate subprocess if receiving SIGINT or SIGTERM signals while model is loading (#3653)
* terminate subprocess if receiving `SIGINT` or `SIGTERM` signals while model is loading

* use `unload` in signal handler
2024-04-15 12:09:32 -04:00
Jeffrey Morgan 309aef7fee
update llama.cpp submodule to 4bd0f93 (#3627) 2024-04-13 10:43:02 -07:00
Michael Yang 3397eff0cd mixtral mem 2024-04-11 11:10:41 -07:00
Michael Yang 7e33a017c0 partial offloading 2024-04-10 11:37:20 -07:00
Michael Yang 8b2c10061c refactor tensor query 2024-04-10 11:37:20 -07:00
Daniel Hiltgen c5ff443b9f Handle very slow model loads
During testing, we're seeing some models take over 3 minutes.
2024-04-09 16:35:10 -07:00
Blake Mizerany 1524f323a3
Revert "build.go: introduce a friendlier way to build Ollama (#3548)" (#3564) 2024-04-09 15:57:45 -07:00
Blake Mizerany fccf3eecaa
build.go: introduce a friendlier way to build Ollama (#3548)
This commit introduces a more friendly way to build Ollama dependencies
and the binary without abusing `go generate` and removing the
unnecessary extra steps it brings with it.

This script also provides nicer feedback to the user about what is
happening during the build process.

At the end, it prints a helpful message to the user about what to do
next (e.g. run the new local Ollama).
2024-04-09 14:18:47 -07:00
Michael Yang c77d45d836
Merge pull request #3506 from ollama/mxyng/quantize-redux
cgo quantize
2024-04-09 12:32:53 -07:00
Jeffrey Morgan 5ec12cec6c
update llama.cpp submodule to 1b67731 (#3561) 2024-04-09 15:10:17 -04:00
Michael Yang 9502e5661f cgo quantize 2024-04-08 15:31:08 -07:00
Jeffrey Morgan 63efa075a0
update generate scripts with new LLAMA_CUDA variable, set HIP_PLATFORM to avoid compiler errors (#3528) 2024-04-07 19:29:51 -04:00
Michael Yang be517e491c no rope parameters 2024-04-05 18:05:27 -07:00
Michael Yang fc8e108642
Merge pull request #3496 from ollama/mxyng/cmd-r-graph
add command-r graph estimate
2024-04-05 12:26:21 -07:00
Daniel Hiltgen dfe330fa1c
Merge pull request #3488 from mofanke/fix-windows-dll-compress
fix dll compress in windows building
2024-04-04 16:12:13 -07:00
Michael Yang 01f77ae25d add command-r graph estimate 2024-04-04 14:07:24 -07:00
Daniel Hiltgen 36bd967722 Fail fast if mingw missing on windows 2024-04-04 09:51:26 -07:00
mofanke 4de0126719 fix dll compress in windows building 2024-04-04 21:27:33 +08:00
Daniel Hiltgen e4a7e5b2ca Fix CI release glitches
The subprocess change moved the build directory
arm64 builds weren't setting cross-compilation flags when building on x86
2024-04-03 16:41:40 -07:00
Michael Yang 12e923e158 update graph size estimate 2024-04-03 13:34:12 -07:00
Jeffrey Morgan cd135317d2
Fix macOS builds on older SDKs (#3467) 2024-04-03 10:45:54 -07:00
Michael Yang 4f895d633f
Merge pull request #3466 from ollama/mxyng/head-kv
default head_kv to 1
2024-04-03 10:41:00 -07:00
Daniel Hiltgen 464d817824
Merge pull request #3464 from dhiltgen/subprocess
Fix numgpu opt miscomparison
2024-04-02 20:10:17 -07:00
Daniel Hiltgen 6589eb8a8c Revert options as a ref in the server 2024-04-02 16:44:10 -07:00
Michael Yang 90f071c658 default head_kv to 1 2024-04-02 16:37:59 -07:00
Michael Yang 80163ebcb5 fix metal gpu 2024-04-02 16:06:45 -07:00
Daniel Hiltgen 0035e31af8 Bump to b2581 2024-04-02 11:53:07 -07:00
Daniel Hiltgen 0a0e9f3e0f Apply 01-cache.diff 2024-04-01 16:48:18 -07:00
Daniel Hiltgen 58d95cc9bd Switch back to subprocessing for llama.cpp
This should resolve a number of memory leak and stability defects by allowing
us to isolate llama.cpp in a separate process and shutdown when idle, and
gracefully restart if it has problems.  This also serves as a first step to be
able to run multiple copies to support multiple models concurrently.
2024-04-01 16:48:18 -07:00
Michael Yang 91b3e4d282 update memory calcualtions
count each layer independently when deciding gpu offloading
2024-04-01 13:16:32 -07:00
Michael Yang d338d70492 refactor model parsing 2024-04-01 13:16:15 -07:00
Patrick Devine 5a5efee46b
Add gemma safetensors conversion (#3250)
Co-authored-by: Michael Yang <mxyng@pm.me>
2024-03-28 18:54:01 -07:00
Jeffrey Morgan f5ca7f8c8e
add license in file header for vendored llama.cpp code (#3351) 2024-03-26 16:23:23 -04:00
Jeffrey Morgan 856b8ec131
remove need for $VSINSTALLDIR since build will fail if ninja cannot be found (#3350) 2024-03-26 16:23:16 -04:00
Patrick Devine 1b272d5bcd
change github.com/jmorganca/ollama to github.com/ollama/ollama (#3347) 2024-03-26 13:04:17 -07:00
Daniel Hiltgen 8091ef2eeb Bump llama.cpp to b2527 2024-03-25 13:47:44 -07:00
Daniel Hiltgen 560be5e0b6
Merge pull request #3308 from dhiltgen/bump_more
Bump llama.cpp to b2510
2024-03-25 12:56:12 -07:00
Jeremy dfc6721b20 add support for libcudart.so for CUDA devices (adds Jetson support) 2024-03-25 11:07:44 -04:00
Blake Mizerany acfa2b9422
llm: prevent race appending to slice (#3320) 2024-03-24 11:35:54 -07:00
Daniel Hiltgen 3e30c75f3e Bump llama.cpp to b2510 2024-03-23 19:55:56 +01:00
Daniel Hiltgen 43799532c1 Bump llama.cpp to b2474
The release just before ggml-cuda.cu refactoring
2024-03-23 09:54:56 +01:00
Daniel Hiltgen 74788b487c Better tmpdir cleanup
If expanding the runners fails, don't leave a corrupt/incomplete payloads dir
We now write a pid file out to the tmpdir, which allows us to scan for stale tmpdirs
and remove this as long as there isn't still a process running.
2024-03-20 16:03:19 +01:00