ollama/llm
Jongwook Choi 12e8c12d2b
Disable CUDA peer access as a workaround for multi-gpu inference bug (#1261)
When CUDA peer access is enabled, multi-gpu inference will produce
garbage output. This is a known bug of llama.cpp (or nvidia). Until the
upstream bug is fixed, we can disable CUDA peer access temporarily
to ensure correct output.

See #961.
2023-11-24 14:05:57 -05:00
..
llama.cpp Disable CUDA peer access as a workaround for multi-gpu inference bug (#1261) 2023-11-24 14:05:57 -05:00
falcon.go starcoder 2023-10-02 19:56:51 -07:00
ggml.go ggufv3 2023-10-23 09:35:49 -07:00
gguf.go fix: gguf int type 2023-11-22 11:40:30 -08:00
llama.go only set main_gpu if value > 0 is provided 2023-11-20 19:54:04 -05:00
llm.go recent llama.cpp update added kernels for fp32, q5_0, and q5_1 2023-11-20 13:44:31 -08:00
starcoder.go starcoder 2023-10-02 19:56:51 -07:00
utils.go partial decode ggml bin for more info 2023-08-10 09:23:10 -07:00