I wonder if Ollama will or plans to have other "Supported backends" than llama.cpp. It's listed on the very last line of their readme as if the llama.cpp dependency is just incidental and a very minor detail rather than Ollama as a deployment mechanism for llama.cpp and gguf based models.
Yes, we are also looking at integrating MLX [1] which is optimized for Apple Silicon and built by an amazing team of individuals, a few of which were behind the original Torch [2] project. There's also TensorRT-LLM [3] by Nvidia optimized for their recent hardware.
All of this of course acknowledging that llama.cpp is an incredible project with competitive performance and support for almost any platform.
I don't think they will move away from llama.cpp until they are forced to. The number of people contributing to llama.cpp is quite significant [1] and it wouldn't make sense to use another backend given how quickly llama.cpp is iterating and growing.
The chance onnx becomes significantly relevant here went from 1% to 15% this week. They're demo'ing ~2x faster inference with Phi-3. There's been fits and starts on LLMs in ONNX for a year, but, with Wintel's AI PC™ push, and all the constituent parts in place (4 bit quants! adaptive quants!), I'd put very good money on it.
I know it is a Microsoft Project. My reasoning is, if Ollama supports ONNX and if it can provide performance on par or better than llama.cpp, it would make sense for Microsoft to acquire Ollama for distribution reasons.
Llama.cpp is the valuable bit here, and Ollama is only good for end user convenience. It saves you 20 minutes of googling and futzing with the million and one llama.cpp wrappers available for every language, lets you set up things to load on startup, but if you're building something for scale or backend, neither llama.cpp or ollama are coming along for the ride. At best it'll live through a proof of concept stage, but as soon as you start caring about performance it's getting discarded.
Microsoft isn't going to pay for something that amounts to a useful setup script wrapped around an inefficient convenience library intended for people to be able to run AI on consumer hardware. There's no exploitable value proposition, whereas building their own closed source AI systems that are tightly coupled to the Windows ecosystem and favor cloud services allows them to extract maximum rent.
Their behaviour around llama.cpp acknowledgement is very shady. Until the very recent, there was no mention of llama.cpp in their README at all and now it's tucked away all the way down. Compare that to the originally proposed PR for example: https://github.com/ollama/ollama/pull/3700
Do you know maybe what are these alternative engines they're talking about? Or is it just a way to evade the fact that at the end of the day it is just a wrapper around llama.cpp?
It was mentioned in another comment to the parent. There are no alternatives currently, the whole thing has been built upon llama.cpp since its inception.
Ollama is great. I actually wish they would wrap OpenAI and Azure and generally act as as a proxy for third party APIs. Having a consistent, well thought out API which isn't tied to a single provider would be really good for the community.
Edit: this would be useful because in many cases some workloads can be local, but others cannot... e.g. if you really need gpt4 for specific queries.
Ollama is simply great! I was quite surprised how easy it is to integrate through their API. A simple chat using Ollama + llama3 is less than 40 lines of TypeScript: https://github.com/wiktor-k/llama-chat
I actually just benchmarked Llama3 70B coding with aider, and it did quite well. It scored similar to GPT 3.5.
You can use Llama3 70B with aider via Ollama [0]. It's also available for free via Groq [1] (with rate limits). And OpenRouter has it available [2] for low cost on their paid api.
Paul's benchmarks are excellent and they're the first thing I look for to get a sense of a new model performance :)
For those looking to create their own benchmarks, promptfoo[0] is one way to do this locally:
prompts:
- "Write this in Python 3: {{ask}}"
providers:
- ollama:chat:llama3:8b
- ollama:chat:phi3
- ollama:chat:qwen:7b
tests:
- vars:
ask: a function to determine if a number is prime
- vars:
ask: a function to split a restaurant bill given individual contributions and shared items
Jumping in because I'm a big believer in (1) local LLMs, and (2) evals specific to individual use cases.
Nice, I've been looking out for something like this! What's Jina AI and how is it local if I need an API key for it? Also, this is the first time I'm hearing about poetry. Might be worth including in the prerequisites (unless I can just stick with pip?)
> What's Jina AI and how is it local if I need an API key for it?
Looks like its a vector DB used for creating and looking up embeddings (vectors). LLM is the second part of RAG, the first part is having a good embedding model.
Oh yes good point, it's because at first I was using a local embedding model and then decided to use JinaAI as they have good embeddings.
Jina AI is doing different things but one of them is having embeddings and I use their English/ German embeddings as in one demo I am working with German data.
You can use pip as well but yes, let me add something about Poetry in case people don't know about it :)
They do, and I was using the "new" models before the update. Perhaps there is tuning or bug fixes for them? Or they just want to confirm that these are supported. There are some new models that do have different architectures, so sometimes an update is necessary.
Phi 3 has a unique architecture that needed some additions to llama.cpp's conversion script. Also Phi 3 is an absolute mess, there's no reliable way to latch on to when it's done writing a message and no one wants to admit it, people are patching around it instead.
ex. I could condition on "\n\n<|assistant|>||<|system|>||<|user>", but it'd still be wrong.
Pretty much everything Phi 3 feels like it needed to all come out within 48 hours a month too early. The ONNX genai library doesn't work on Mac, at all, the mobile SDKs don't support it...sigh
Because the way they are quantized takes time to get bug-free when new architectures are released. If a model was quantized with a known bug in the quantizer, then it effectively makes those quantized versions buggy and they need to be requantized with a new version of llamacpp which has this fixed.
this looks very awesome. can someone tell me why there is no chatter about this? is there something else out there that blows this out of the water in terms of ease of use and access to sample many LLM's ?
Ollama provides a web server with API that just works out of the box, which is great when you want to integrate multiple applications (potentially distributed on smaller edge devices) with LLMs that run on a single beefy machine.
In my home I have a large gaming rig that sometimes runs Ollama+Open WebUI, then I also have a bunch of other services running on a smaller server and a Raspberry Pi which reach out to Ollama for their LLM inference needs.
Are you talking about the Hugging Face Python libraries, the Hugging Face hosted inference APIs, the Hugging Face web interfaces, the Hugging Face iPhone app, Hugging Face Spaces (hosted Docker environments with GPU access) or something else?
But it's likely to be much slower than what you'd get with a backend like llama.cpp on CPU (particularly if you're running on a Mac, but I think on Linux as well), as well as not supporting features like CPU offloading.
I think the biggest selling point of ollama (llama.cpp) are quantizations, for a slight hit (with q8 or q4) in quality you can get a significant performance boost.
Does ollama/llama.cpp provide low bit operations (avx or cuda kernels) to speed up inference? Or just model compression with inference still done in fp16?
My understanding is the modern quantization algorithms are typically implemented in Pytorch.
The only thing I know (from using it) that with quantization I can fit models like llama2 13b, in my 24GB of VRAM when I use q8 (16GB) instead of fp16 (26GB). This means I can get nearly the full quality of llama2 13b's output while still being able to use only my GPU, without the need to do very slow inference on only CPU+RAM.
And the models are quantized before inference, so I'd only download 16GB for the llama2 13b q8 instead of the full 26GB, which means it's not done on the fly.
As an aside, even gpt4 level quality does not feel satisfactory to me lately. I can’t imagine willingly using models as dumb as llama2-13b. What do you do with it?
Yeah I agree, everytime a new model releases I download the highest quantization or fp16, that fits into my VRAM, test it out with a few prompts, and then realize that downloadable models are still not as good as the closed ones (except speed wise).
I don't know why I still do it, but everytime I read so many comments how good model X is, and how it outperforms anything else, and then I want to see it for myself.
Ollama is really organized - it relies on llama but the UX and organization it provides makes it legit. We recently made a one-click wizard to run Open WebUI and Ollama together, self hosted and remotely accessible but locally hosted [1]
LM Studio is a lot more user friendly, probably the easiest UI to use out there. No terminal nonsense, no manual to read. Just double click and chat. It even explains to you what the model names mean (eg diff between Q4_1 Q4_K Q4_K_M... For whatever reason all the other tools assume you know what it means).
Why do you think there is no chatter about this? There have been hundreds of posts about ollama on HN. This is a point release of an already well known project.
I use a mix of using llamacpp directly via my own python bindings and using it via llamacpp-python for function calling and full control over parameters and loading, but otherwise ollama is just great for ease of use. There's really not a reason not to use it, if just want to load gguf models and don't have any intricate requirements.