Commit graph

411 commits

Author SHA1 Message Date
Jeffrey Morgan 309aef7fee
update llama.cpp submodule to 4bd0f93 (#3627) 2024-04-13 10:43:02 -07:00
Michael Yang 3397eff0cd mixtral mem 2024-04-11 11:10:41 -07:00
Michael Yang 7e33a017c0 partial offloading 2024-04-10 11:37:20 -07:00
Michael Yang 8b2c10061c refactor tensor query 2024-04-10 11:37:20 -07:00
Daniel Hiltgen c5ff443b9f Handle very slow model loads
During testing, we're seeing some models take over 3 minutes.
2024-04-09 16:35:10 -07:00
Blake Mizerany 1524f323a3
Revert "build.go: introduce a friendlier way to build Ollama (#3548)" (#3564) 2024-04-09 15:57:45 -07:00
Blake Mizerany fccf3eecaa
build.go: introduce a friendlier way to build Ollama (#3548)
This commit introduces a more friendly way to build Ollama dependencies
and the binary without abusing `go generate` and removing the
unnecessary extra steps it brings with it.

This script also provides nicer feedback to the user about what is
happening during the build process.

At the end, it prints a helpful message to the user about what to do
next (e.g. run the new local Ollama).
2024-04-09 14:18:47 -07:00
Michael Yang c77d45d836
Merge pull request #3506 from ollama/mxyng/quantize-redux
cgo quantize
2024-04-09 12:32:53 -07:00
Jeffrey Morgan 5ec12cec6c
update llama.cpp submodule to 1b67731 (#3561) 2024-04-09 15:10:17 -04:00
Michael Yang 9502e5661f cgo quantize 2024-04-08 15:31:08 -07:00
Jeffrey Morgan 63efa075a0
update generate scripts with new LLAMA_CUDA variable, set HIP_PLATFORM to avoid compiler errors (#3528) 2024-04-07 19:29:51 -04:00
Michael Yang be517e491c no rope parameters 2024-04-05 18:05:27 -07:00
Michael Yang fc8e108642
Merge pull request #3496 from ollama/mxyng/cmd-r-graph
add command-r graph estimate
2024-04-05 12:26:21 -07:00
Daniel Hiltgen dfe330fa1c
Merge pull request #3488 from mofanke/fix-windows-dll-compress
fix dll compress in windows building
2024-04-04 16:12:13 -07:00
Michael Yang 01f77ae25d add command-r graph estimate 2024-04-04 14:07:24 -07:00
Daniel Hiltgen 36bd967722 Fail fast if mingw missing on windows 2024-04-04 09:51:26 -07:00
mofanke 4de0126719 fix dll compress in windows building 2024-04-04 21:27:33 +08:00
Daniel Hiltgen e4a7e5b2ca Fix CI release glitches
The subprocess change moved the build directory
arm64 builds weren't setting cross-compilation flags when building on x86
2024-04-03 16:41:40 -07:00
Michael Yang 12e923e158 update graph size estimate 2024-04-03 13:34:12 -07:00
Jeffrey Morgan cd135317d2
Fix macOS builds on older SDKs (#3467) 2024-04-03 10:45:54 -07:00
Michael Yang 4f895d633f
Merge pull request #3466 from ollama/mxyng/head-kv
default head_kv to 1
2024-04-03 10:41:00 -07:00
Daniel Hiltgen 464d817824
Merge pull request #3464 from dhiltgen/subprocess
Fix numgpu opt miscomparison
2024-04-02 20:10:17 -07:00
Daniel Hiltgen 6589eb8a8c Revert options as a ref in the server 2024-04-02 16:44:10 -07:00
Michael Yang 90f071c658 default head_kv to 1 2024-04-02 16:37:59 -07:00
Michael Yang 80163ebcb5 fix metal gpu 2024-04-02 16:06:45 -07:00
Daniel Hiltgen 0035e31af8 Bump to b2581 2024-04-02 11:53:07 -07:00
Daniel Hiltgen 0a0e9f3e0f Apply 01-cache.diff 2024-04-01 16:48:18 -07:00
Daniel Hiltgen 58d95cc9bd Switch back to subprocessing for llama.cpp
This should resolve a number of memory leak and stability defects by allowing
us to isolate llama.cpp in a separate process and shutdown when idle, and
gracefully restart if it has problems.  This also serves as a first step to be
able to run multiple copies to support multiple models concurrently.
2024-04-01 16:48:18 -07:00
Michael Yang 91b3e4d282 update memory calcualtions
count each layer independently when deciding gpu offloading
2024-04-01 13:16:32 -07:00
Michael Yang d338d70492 refactor model parsing 2024-04-01 13:16:15 -07:00
Patrick Devine 5a5efee46b
Add gemma safetensors conversion (#3250)
Co-authored-by: Michael Yang <mxyng@pm.me>
2024-03-28 18:54:01 -07:00
Jeffrey Morgan f5ca7f8c8e
add license in file header for vendored llama.cpp code (#3351) 2024-03-26 16:23:23 -04:00
Jeffrey Morgan 856b8ec131
remove need for $VSINSTALLDIR since build will fail if ninja cannot be found (#3350) 2024-03-26 16:23:16 -04:00
Patrick Devine 1b272d5bcd
change github.com/jmorganca/ollama to github.com/ollama/ollama (#3347) 2024-03-26 13:04:17 -07:00
Daniel Hiltgen 8091ef2eeb Bump llama.cpp to b2527 2024-03-25 13:47:44 -07:00
Daniel Hiltgen 560be5e0b6
Merge pull request #3308 from dhiltgen/bump_more
Bump llama.cpp to b2510
2024-03-25 12:56:12 -07:00
Jeremy dfc6721b20 add support for libcudart.so for CUDA devices (adds Jetson support) 2024-03-25 11:07:44 -04:00
Blake Mizerany acfa2b9422
llm: prevent race appending to slice (#3320) 2024-03-24 11:35:54 -07:00
Daniel Hiltgen 3e30c75f3e Bump llama.cpp to b2510 2024-03-23 19:55:56 +01:00
Daniel Hiltgen 43799532c1 Bump llama.cpp to b2474
The release just before ggml-cuda.cu refactoring
2024-03-23 09:54:56 +01:00
Daniel Hiltgen 74788b487c Better tmpdir cleanup
If expanding the runners fails, don't leave a corrupt/incomplete payloads dir
We now write a pid file out to the tmpdir, which allows us to scan for stale tmpdirs
and remove this as long as there isn't still a process running.
2024-03-20 16:03:19 +01:00
Michael Yang 3c4ad0ecab dyn global 2024-03-18 09:45:45 +01:00
Michael Yang 22f326464e
Merge pull request #3083 from ollama/mxyng/refactor-readseeker
refactor readseeker
2024-03-16 12:08:56 -07:00
Jeffrey Morgan e95ffc7448
llama: remove server static assets (#3174) 2024-03-15 19:24:12 -07:00
Daniel Hiltgen ab3456207b
Merge pull request #3028 from ollama/ci_release
CI release process
2024-03-15 16:40:54 -07:00
Daniel Hiltgen 6ad414f31e
Merge pull request #3086 from dhiltgen/import_server
Import server.cpp to retain llava support
2024-03-15 16:10:35 -07:00
Daniel Hiltgen d4c10df2b0 Add Radeon gfx940-942 GPU support 2024-03-15 15:34:58 -07:00
Daniel Hiltgen 540f4af45f Wire up more complete CI for releases
Flesh out our github actions CI so we can build official releaes.
2024-03-15 12:37:36 -07:00
Blake Mizerany 6ce37e4d96
llm,readline: use errors.Is instead of simple == check (#3161)
This fixes some brittle, simple equality checks to use errors.Is. Since
go1.13, errors.Is is the idiomatic way to check for errors.

Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
2024-03-15 07:14:12 -07:00
Michael Yang 291c663865 fix: clip memory leak 2024-03-14 13:12:42 -07:00
Jeffrey Morgan e72c567cfd
restore locale patch (#3091) 2024-03-12 22:08:13 -07:00
Bruce MacDonald 3e22611200
token repeat limit for prediction requests (#3080) 2024-03-12 22:08:25 -04:00
Bruce MacDonald 2f804068bd
warn when json format is expected but not mentioned in prompt (#3081) 2024-03-12 19:07:11 -04:00
Daniel Hiltgen 85129d3a32 Adapt our build for imported server.cpp 2024-03-12 14:57:15 -07:00
Daniel Hiltgen 9ac6440da3 Import server.cpp as of b2356 2024-03-12 13:58:06 -07:00
Michael Yang 0085297928 refactor readseeker 2024-03-12 12:54:18 -07:00
racerole 53c107e20e
chore: fix typo (#3073)
Signed-off-by: racerole <jiangyifeng@outlook.com>
2024-03-12 14:09:22 -04:00
Bruce MacDonald b80661e8c7
relay load model errors to the client (#3065) 2024-03-11 16:48:27 -04:00
Jeffrey Morgan 369eda65f5
update llama.cpp submodule to ceca1ae (#3064) 2024-03-11 12:57:48 -07:00
Daniel Hiltgen bc13da2bfe Avoid rocm runner and dependency clash
Putting the rocm symlink next to the runners is risky.  This moves
the payloads into a subdir to avoid potential clashes.
2024-03-11 09:33:22 -07:00
Jeffrey Morgan 41b00b9856 fix 03-locale.diff 2024-03-10 16:21:05 -07:00
Daniel Hiltgen 3dc1bb6a35 Harden for deps file being empty (or short) 2024-03-10 14:45:38 -07:00
Jeffrey Morgan 908005d90b
patch: use default locale in wpm tokenizer (#3034) 2024-03-09 21:12:12 -08:00
Jeffrey Morgan e11668aa07 add bundle_metal and cleanup_metal funtions to gen_darwin.sh 2024-03-09 16:04:57 -08:00
Jeffrey Morgan 1ffb1e2874
update llama.cpp submodule to 77d1ac7 (#3030) 2024-03-09 15:55:34 -08:00
Jeffrey Morgan f9cd55c70b disable gpu for certain model architectures and fix divide-by-zero on memory estimation 2024-03-09 12:51:38 -08:00
Daniel Hiltgen 4a5c9b8035 Finish unwinding idempotent payload logic
The recent ROCm change partially removed idempotent
payloads, but the ggml-metal.metal file for mac was still
idempotent.  This finishes switching to always extract
the payloads, and now that idempotentcy is gone, the
version directory is no longer useful.
2024-03-09 08:34:39 -08:00
Jeffrey Morgan efe5617b64
update llama.cpp submodule to c2101a2 (#3020) 2024-03-09 00:44:50 -08:00
Michael Yang 76bdebbadf decode ggla 2024-03-08 15:46:25 -08:00
Jeffrey Morgan 0e4669b04f
update llama.cpp submodule to 6cdabe6 (#2999) 2024-03-08 00:26:20 -08:00
Daniel Hiltgen 6c5ccb11f9 Revamp ROCm support
This refines where we extract the LLM libraries to by adding a new
OLLAMA_HOME env var, that defaults to `~/.ollama` The logic was already
idempotenent, so this should speed up startups after the first time a
new release is deployed.  It also cleans up after itself.

We now build only a single ROCm version (latest major) on both windows
and linux.  Given the large size of ROCms tensor files, we split the
dependency out.  It's bundled into the installer on windows, and a
separate download on windows.  The linux install script is now smart and
detects the presence of AMD GPUs and looks to see if rocm v6 is already
present, and if not, then downloads our dependency tar file.

For Linux discovery, we now use sysfs and check each GPU against what
ROCm supports so we can degrade to CPU gracefully instead of having
llama.cpp+rocm assert/crash on us.  For Windows, we now use go's windows
dynamic library loading logic to access the amdhip64.dll APIs to query
the GPU information.
2024-03-07 10:36:50 -08:00
John 23ebe8fe11
fix some typos (#2973)
Signed-off-by: hishope <csqiye@126.com>
2024-03-06 22:50:11 -08:00
Patrick Devine 2c017ca441
Convert Safetensors to an Ollama model (#2824) 2024-03-06 21:01:51 -08:00
Jeffrey Morgan 21347e1ed6
update llama.cpp submodule to c29af7e (#2868) 2024-03-01 15:26:04 -08:00
Daniel Hiltgen bd1d8b0d14
Merge pull request #2836 from bmwiedemann/gzip
Omit build date from gzip headers
2024-02-29 15:46:46 -08:00
Jeffrey Morgan cbf4970e0f
bump submodule to 87c91c07663b707e831c59ec373b5e665ff9d64a (#2828) 2024-02-29 09:42:08 -08:00
Bernhard M. Wiedemann 76e5d9ec88 Omit build date from gzip headers
See https://reproducible-builds.org/ for why this is good.

This patch was done while working on reproducible builds for openSUSE.
2024-02-29 16:48:19 +01:00
Daniel Hiltgen 061e8f6abc Bump llama.cpp to b2276 2024-02-26 16:49:24 -08:00
Jeffrey Morgan 11bfff8ee1 update llama.cpp submodule to 96633eeca1265ed03e57230de54032041c58f9cd 2024-02-22 16:44:26 -05:00
Jeffrey Morgan efe040f8c0
reset with init_vars ahead of each cpu build in gen_windows.ps1 (#2654) 2024-02-21 16:35:34 -05:00
Jeffrey Morgan 2a7553ce09 update llama.cpp submodule to c14f72d 2024-02-21 09:03:14 -05:00
Jeffrey Morgan b3eac61cac update llama.cpp submodule to f0d1fafc029a056cd765bdae58dcaa12312e9879 2024-02-20 22:56:51 -05:00
Michael Yang 949d7b1c48
add gguf file types (#2532) 2024-02-20 19:06:29 -05:00
Jeffrey Morgan 4613a080e7
update llama.cpp submodule to 66c1968f7 (#2618) 2024-02-20 17:42:31 -05:00
Taras Tsugrii 01ff2e14db
[nit] Remove unused msg local var. (#2511) 2024-02-20 14:02:34 -05:00
Daniel Hiltgen 4fcbf1cde6
Merge pull request #2599 from dhiltgen/fix_avx
Explicitly disable AVX2 on GPU builds
2024-02-19 13:13:05 -08:00
Daniel Hiltgen 9220b4fa91
Merge pull request #2585 from dhiltgen/cuda_leaks
Fix cuda leaks
2024-02-19 12:48:00 -08:00
Daniel Hiltgen fc39a6cd7a Fix cuda leaks
This should resolve the problem where we don't fully unload from the GPU
when we go idle.
2024-02-18 18:37:20 -08:00
Daniel Hiltgen df6dc4fd96 Fix duplicate menus on update and exit on signals
Also fixes a few fit-and-finish items for better developer experience
2024-02-16 15:33:16 -08:00
Daniel Hiltgen db2a9ad1fe Explicitly disable AVX2 on GPU builds
Even though we weren't setting it to on, somewhere in the cmake config
it was getting toggled on.  By explicitly setting it to off, we get `/arch:AVX`
as intended.
2024-02-15 14:50:11 -08:00
Daniel Hiltgen 29e90cc13b Implement new Go based Desktop app
This focuses on Windows first, but coudl be used for Mac
and possibly linux in the future.
2024-02-15 05:56:45 +00:00
Jeffrey Morgan 9241a29336
Revert "Revert "bump submodule to 6c00a06 (#2479)"" (#2485)
This reverts commit 6920964b87.
2024-02-13 18:18:41 -08:00
Jeffrey Morgan f7231ad9ad
set shutting_down to false once shutdown is complete (#2484) 2024-02-13 17:48:41 -08:00
Jeffrey Morgan 6920964b87 Revert "bump submodule to 6c00a06 (#2479)"
This reverts commit 2f9ed52bbd.
2024-02-13 17:23:05 -08:00
Jeffrey Morgan 2f9ed52bbd
bump submodule to 6c00a06 (#2479) 2024-02-13 17:12:42 -08:00
Daniel Hiltgen 939c60473f
Merge pull request #2422 from dhiltgen/better_kill
More robust shutdown
2024-02-12 14:05:06 -08:00
Jeffrey Morgan f76ca04f9e
update submodule to 099afc6 (#2468) 2024-02-12 14:01:16 -08:00
Daniel Hiltgen 76b8728f0c
Merge pull request #2465 from dhiltgen/block_rocm_pre_9
Detect AMD GPU info via sysfs and block old cards
2024-02-12 12:41:43 -08:00
Daniel Hiltgen 6d84f07505 Detect AMD GPU info via sysfs and block old cards
This wires up some new logic to start using sysfs to discover AMD GPU
information and detects old cards we can't yet support so we can fallback to CPU mode.
2024-02-12 08:19:41 -08:00
Jeffrey Morgan 26b13fc33c
patch: always add token to cache_tokens (#2459) 2024-02-12 08:10:16 -08:00