Commit graph

47 commits

Author SHA1 Message Date
Michael Yang 6ffb5cb017 add conversion for microsoft phi 3 mini/medium 4k, 128 2024-08-12 15:13:29 -07:00
Michael Yang df993fa37b comments 2024-07-31 15:58:55 -07:00
Michael Yang 5e9db9fb0b refactor convert 2024-07-31 15:58:33 -07:00
Michael Yang 6b252918fb update convert test to check result data 2024-07-31 10:59:38 -07:00
Michael Yang 4a565cbf94 add chat and generate tests with mock runner 2024-07-16 09:39:31 -07:00
Blake Mizerany cb42e607c5
llm: speed up gguf decoding by a lot (#5246)
Previously, some costly things were causing the loading of GGUF files
and their metadata and tensor information to be VERY slow:

  * Too many allocations when decoding strings
  * Hitting disk for each read of each key and value, resulting in a
    not-okay amount of syscalls/disk I/O.

The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro
m3.

This commit also prevents collecting large arrays of values when
decoding GGUFs (if desired). When such keys are encountered, their
values are null, and are encoded as such in JSON.

Also, this fixes a broken test that was not encoding valid GGUF.
2024-06-24 21:47:52 -07:00
Michael Yang 7bdcd1da94 Revert "Merge pull request #4938 from ollama/mxyng/fix-byte-order"
This reverts commit f5f245cc15, reversing
changes made to 94d37fdcae.

this change broke gguf v2 which is incorrectly detected as big endian
2024-06-11 15:56:17 -07:00
Michael Yang 620d5c569e fix parsing big endian gguf 2024-06-08 12:35:26 -07:00
Michael Yang 030e765e76 fix create model when template detection errors 2024-06-07 10:51:35 -07:00
Michael Yang e40145a39d lint 2024-06-04 11:13:30 -07:00
Michael Yang 171eb040fc simplify safetensors reading 2024-05-21 11:28:22 -07:00
Michael Yang bbbd9f20f3 cleanup 2024-05-20 16:13:57 -07:00
Michael Yang 547132e820 bpe pretokenizer 2024-05-20 16:13:57 -07:00
Patrick Devine c8cf0d94ed llama3 conversion 2024-05-20 16:13:57 -07:00
Patrick Devine 14476d48cc
fixes for gguf (#3863) 2024-04-23 20:57:20 -07:00
Michael Yang e74163af4c fix padding to only return padding 2024-04-16 15:43:26 -07:00
Michael Yang 6d53b67c2c
Merge pull request #3663 from ollama/mxyng/fix-padding 2024-04-15 17:44:54 -07:00
Michael Yang 969238b19e fix padding in decode
TODO: update padding() to _only_ returning the padding
2024-04-15 17:27:06 -07:00
Patrick Devine 9f8691c6c8
Add llama2 / torch models for ollama create (#3607) 2024-04-15 11:26:42 -07:00
Michael Yang 8b2c10061c refactor tensor query 2024-04-10 11:37:20 -07:00
Michael Yang d338d70492 refactor model parsing 2024-04-01 13:16:15 -07:00
Patrick Devine 5a5efee46b
Add gemma safetensors conversion (#3250)
Co-authored-by: Michael Yang <mxyng@pm.me>
2024-03-28 18:54:01 -07:00
Patrick Devine 1b272d5bcd
change github.com/jmorganca/ollama to github.com/ollama/ollama (#3347) 2024-03-26 13:04:17 -07:00
Michael Yang 22f326464e
Merge pull request #3083 from ollama/mxyng/refactor-readseeker
refactor readseeker
2024-03-16 12:08:56 -07:00
Blake Mizerany 6ce37e4d96
llm,readline: use errors.Is instead of simple == check (#3161)
This fixes some brittle, simple equality checks to use errors.Is. Since
go1.13, errors.Is is the idiomatic way to check for errors.

Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
2024-03-15 07:14:12 -07:00
Michael Yang 0085297928 refactor readseeker 2024-03-12 12:54:18 -07:00
Michael Yang 76bdebbadf decode ggla 2024-03-08 15:46:25 -08:00
Patrick Devine 2c017ca441
Convert Safetensors to an Ollama model (#2824) 2024-03-06 21:01:51 -08:00
Michael Yang 949d7b1c48
add gguf file types (#2532) 2024-02-20 19:06:29 -05:00
Michael Yang cd22855ef8 refactor tensor read 2024-01-24 10:48:31 -08:00
Michael Yang eaed6f8c45 add max context length check 2024-01-12 14:54:07 -08:00
Jeffrey Morgan 08f1e18965
Offload layers to GPU based on new model size estimates (#1850)
* select layers based on estimated model memory usage

* always account for scratch vram

* dont load +1 layers

* better estmation for graph alloc

* Update gpu/gpu_darwin.go

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

* Update llm/llm.go

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

* Update llm/llm.go

* add overhead for cuda memory

* Update llm/llm.go

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

* fix build error on linux

* address comments

---------

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2024-01-08 16:42:00 -05:00
Michael Yang 56ffc3023a remove per-model types
mostly replaced by decoding tensors except ggml models which only
support llama
2023-12-11 09:40:21 -08:00
Michael Yang 5a5dca13b2 comments 2023-12-04 16:59:23 -08:00
Michael Yang 72e7a49aa9 seek instead of copyn 2023-12-04 16:59:23 -08:00
Michael Yang 2cb0fa7d40 split from into one or more models 2023-12-04 16:59:23 -08:00
Michael Yang 199941cd15 fix: gguf int type 2023-11-22 11:40:30 -08:00
Michael Yang c5e1bbabda
instead of static number of parameters for each model family, get the real number from the tensors (#1022)
* parse tensor info

* refactor decoder

* return actual parameter count

* explicit rounding

* s/Human/HumanNumber/
2023-11-08 17:55:46 -08:00
Michael Yang 125d0a013a ggufv3
ggufv3 adds support for big endianness, mainly for s390x architecture.
while that's not currently supported for ollama, the change is simple.

loosen version check to be more forward compatible. unless specified,
gguf versions other v1 will be decoded into v2.
2023-10-23 09:35:49 -07:00
Michael Yang c02c0cd483 starcoder 2023-10-02 19:56:51 -07:00
Bruce MacDonald 86279f4ae3
unbound max num gpu layers (#591)
---------

Co-authored-by: Michael Yang <mxyng@pm.me>
2023-09-25 18:36:46 -04:00
Bruce MacDonald 4cba75efc5
remove tmp directories created by previous servers (#559)
* remove tmp directories created by previous servers

* clean up on server stop

* Update routes.go

* Update server/routes.go

Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>

* create top-level temp ollama dir

* check file exists before creating

---------

Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
2023-09-21 20:38:49 +01:00
Bruce MacDonald 66003e1d05
subprocess improvements (#524)
* subprocess improvements

- increase start-up timeout
- when runner fails to start fail rather than timing out
- try runners in order rather than choosing 1 runner
- embed metal runner in metal dir rather than gpu
- refactor logging and error messages

* Update llama.go

* Update llama.go

* simplify by using glob
2023-09-18 15:16:32 -04:00
Bruce MacDonald 2540c9181c
support for packaging in multiple cuda runners (#509)
* enable packaging multiple cuda versions
* use nvcc cuda version if available

---------

Co-authored-by: Michael Yang <mxyng@pm.me>
2023-09-14 15:08:13 -04:00
Michael Yang 0c5a454361 fix model type for 70b 2023-09-12 15:12:59 -07:00
Michael Yang 7dee25a07f fix falcon decode
get model and file type from bin file
2023-09-12 12:34:53 -07:00
Bruce MacDonald 09dd2aeff9
GGUF support (#441) 2023-09-07 13:55:37 -04:00