Commit graph

62 commits

Author SHA1 Message Date
Daniel Hiltgen 0498f7ce56 Get rid of one-line llama.log
This one log line was triggering a single line llama.log to be generated
in the pwd of the server
2024-01-02 15:36:16 -08:00
Daniel Hiltgen 738a8d12eb Rename the ollama cmakefile 2024-01-02 15:36:16 -08:00
Daniel Hiltgen d966b730ac Switch windows build to fully dynamic
Refactor where we store build outputs, and support a fully dynamic loading
model on windows so the base executable has no special dependencies thus
doesn't require a special PATH.
2024-01-02 15:36:16 -08:00
Daniel Hiltgen 9a70aecccb Refactor how we augment llama.cpp
This changes the model for llama.cpp inclusion so we're not applying a patch,
but instead have the C++ code directly in the ollama tree, which should make it
easier to refine and update over time.
2024-01-02 15:35:55 -08:00
Daniel Hiltgen e5202eb687 Quiet down llama.cpp logging by default
By default builds will now produce non-debug and non-verbose binaries.
To enable verbose logs in llama.cpp and debug symbols in the
native code, set `CGO_CFLAGS=-g`
2023-12-22 08:47:18 -08:00
Daniel Hiltgen fa24e73b82 Remove CPU build, fixup linux build script 2023-12-21 18:21:31 -08:00
Daniel Hiltgen 325d74985b Fix CPU performance on hyperthreaded systems
The default thread count logic was broken and resulted in 2x the number
of threads as it should on a hyperthreading CPU
resulting in thrashing and poor performance.
2023-12-21 16:23:36 -08:00
Daniel Hiltgen d9cd3d9667 Revive windows build
The windows native setup still needs some more work, but this gets it building
again and if you set the PATH properly, you can run the resulting exe on a cuda system.
2023-12-20 17:21:54 -08:00
Daniel Hiltgen 7555ea44f8 Revamp the dynamic library shim
This switches the default llama.cpp to be CPU based, and builds the GPU variants
as dynamically loaded libraries which we can select at runtime.

This also bumps the ROCm library to version 6 given 5.7 builds don't work
on the latest ROCm library that just shipped.
2023-12-20 14:45:57 -08:00
Daniel Hiltgen 6558f94ed0 Fix darwin intel build 2023-12-19 13:32:24 -08:00
Daniel Hiltgen 1b991d0ba9 Refine build to support CPU only
If someone checks out the ollama repo and doesn't install the CUDA
library, this will ensure they can build a CPU only version
2023-12-19 09:05:46 -08:00
Daniel Hiltgen 9adca7f711 Bump llama.cpp to b1662 and set n_parallel=1 2023-12-19 09:05:46 -08:00
Daniel Hiltgen 89bbaafa64 Build linux using ubuntu 20.04
This changes the container-based linux build to use an older Ubuntu
distro to improve our compatibility matrix for older user machines
2023-12-19 09:05:46 -08:00
Daniel Hiltgen 35934b2e05 Adapted rocm support to cgo based llama.cpp 2023-12-19 09:05:46 -08:00
65a f8ef4439e9 Use build tags to generate accelerated binaries for CUDA and ROCm on Linux.
The build tags rocm or cuda must be specified to both go generate and go build.
ROCm builds should have both ROCM_PATH set (and the ROCM SDK present) as well
as CLBlast installed (for GGML) and CLBlast_DIR set in the environment to the
CLBlast cmake directory (likely /usr/lib/cmake/CLBlast). Build tags are also
used to switch VRAM detection between cuda and rocm implementations, using
added "accelerator_foo.go" files which contain architecture specific functions
and variables. accelerator_none is used when no tags are set, and a helper
function addRunner will ignore it if it is the chosen accelerator. Fix go
generate commands, thanks @deadmeu for testing.
2023-12-19 09:05:46 -08:00
Daniel Hiltgen d4cd695759 Add cgo implementation for llama.cpp
Run the server.cpp directly inside the Go runtime via cgo
while retaining the LLM Go abstractions.
2023-12-19 09:05:46 -08:00
Bruce MacDonald 811b1f03c8 deprecate ggml
- remove ggml runner
- automatically pull gguf models when ggml detected
- tell users to update to gguf in the case automatic pull fails

Co-Authored-By: Jeffrey Morgan <jmorganca@gmail.com>
2023-12-19 09:05:46 -08:00
Jeffrey Morgan 6b5bdfa6c9 update runner submodule 2023-12-18 17:33:46 -05:00
Jeffrey Morgan c063ee4af0 update runner submodule to fix hipblas build 2023-12-18 15:41:13 -05:00
Jeffrey Morgan b85982eb91 update runner submodule 2023-12-18 12:43:31 -05:00
Jeffrey Morgan 31f0551dab
Update runner to support mixtral and mixture of experts (MoE) (#1475) 2023-12-13 17:15:10 -05:00
Michael Yang 6deebf2489 update for qwen 2023-12-04 11:38:05 -08:00
Jeffrey Morgan 16a9006306 add back f16c instructions on intel mac 2023-11-26 15:59:49 -05:00
Jeffrey Morgan 9e4a316405 update submodule commit 2023-11-26 14:52:00 -05:00
Jing Zhang 82b9b329ff
windows CUDA support (#1262)
* Support cuda build in Windows
* Enable dynamic NumGPU allocation for Windows
2023-11-24 17:16:36 -05:00
Jongwook Choi 12e8c12d2b
Disable CUDA peer access as a workaround for multi-gpu inference bug (#1261)
When CUDA peer access is enabled, multi-gpu inference will produce
garbage output. This is a known bug of llama.cpp (or nvidia). Until the
upstream bug is fixed, we can disable CUDA peer access temporarily
to ensure correct output.

See #961.
2023-11-24 14:05:57 -05:00
Jeffrey Morgan d77dde126b consistent cpu instructions on macos and linux 2023-11-22 16:26:46 -05:00
Michael Yang a00fac4ec8 update llama.cpp 2023-11-21 09:50:02 -08:00
Jeffrey Morgan 13ba6df5ab enable cpu instructions on intel macs 2023-11-19 23:20:26 -05:00
Jeffrey Morgan 41434a7cdc build intel mac with correct binary and compile flags 2023-11-16 22:14:51 -05:00
Jeffrey Morgan 3a1ed9ff70
restore building runner with AVX on by default (#900) 2023-10-27 12:13:44 -07:00
Jeffrey Morgan b0c9cd0f3b fix metal assertion errors 2023-10-24 00:32:36 -07:00
Jeffrey Morgan 77f61c6301 update submodule commit 2023-10-24 00:30:27 -07:00
Jeffrey Morgan f3604534e5 update submodule commit 2023-10-23 23:59:12 -07:00
Michael Yang 0c7a00a264 bump submodules
pin to 9e70cc03229df19ca2d28ce23cc817198f897278 for now since
438c2ca83045a00ef244093d27e9ed41a8cb4ea9 is breaking
2023-10-23 11:17:59 -07:00
Michael Yang c9167494cb update default log target 2023-10-23 10:44:50 -07:00
Bruce MacDonald f3648fd206
Update llama.cpp gguf to latest (#710) 2023-10-17 16:55:16 -04:00
Jeffrey Morgan ab0668293c llm: fix build on amd64 2023-10-06 14:39:54 -07:00
Bruce MacDonald 5d22319a2c
rename server subprocess (#700)
- this makes it easier to see that the subprocess is associated with ollama
2023-10-06 10:15:42 -04:00
Michael Yang 058d0cd04b silence warm up log 2023-09-21 14:53:33 -07:00
Michael Yang ee1c994d15
update submodule (#567) 2023-09-21 16:22:23 -04:00
Michael Yang a9ed7cc6aa rename generate.go 2023-09-20 14:42:17 -07:00
Michael Yang 6c6a31a1e8 embed libraries using cmake 2023-09-20 14:41:57 -07:00
Bruce MacDonald fc6ec356fc remove libcuda.so 2023-09-20 20:36:14 +01:00
Bruce MacDonald 1255bc9b45 only package 11.8 runner 2023-09-20 20:00:41 +01:00
Bruce MacDonald b9bb5ca288 use cuda_version 2023-09-20 17:58:16 +01:00
Bruce MacDonald 4e8be787c7 pack in cuda libs 2023-09-20 17:40:42 +01:00
Bruce MacDonald 66003e1d05
subprocess improvements (#524)
* subprocess improvements

- increase start-up timeout
- when runner fails to start fail rather than timing out
- try runners in order rather than choosing 1 runner
- embed metal runner in metal dir rather than gpu
- refactor logging and error messages

* Update llama.go

* Update llama.go

* simplify by using glob
2023-09-18 15:16:32 -04:00
Bruce MacDonald 2540c9181c
support for packaging in multiple cuda runners (#509)
* enable packaging multiple cuda versions
* use nvcc cuda version if available

---------

Co-authored-by: Michael Yang <mxyng@pm.me>
2023-09-14 15:08:13 -04:00
Bruce MacDonald f59c4d03f7
fix ggml arm64 cuda build (#520) 2023-09-12 17:06:48 -04:00