Hacker News new | past | comments | ask | show | jobs | submit login
Ollama v0.1.33 with Llama 3, Phi 3, and Qwen 110B (github.com/ollama)
192 points by ashvardanian 15 days ago | hide | past | favorite | 64 comments



I wonder if Ollama will or plans to have other "Supported backends" than llama.cpp. It's listed on the very last line of their readme as if the llama.cpp dependency is just incidental and a very minor detail rather than Ollama as a deployment mechanism for llama.cpp and gguf based models.


Yes, we are also looking at integrating MLX [1] which is optimized for Apple Silicon and built by an amazing team of individuals, a few of which were behind the original Torch [2] project. There's also TensorRT-LLM [3] by Nvidia optimized for their recent hardware.

All of this of course acknowledging that llama.cpp is an incredible project with competitive performance and support for almost any platform.

[1] https://github.com/ml-explore/mlx

[2] https://en.wikipedia.org/wiki/Torch_(machine_learning)

[3] https://github.com/NVIDIA/TensorRT-LLM


MLX and TensorRT would be really nice!


I don't think they will move away from llama.cpp until they are forced to. The number of people contributing to llama.cpp is quite significant [1] and it wouldn't make sense to use another backend given how quickly llama.cpp is iterating and growing.

[1] https://devboard.gitsense.com/ggerganov?r=ggerganov%2Fllama....

Full disclosure: This is my tool


ghost of christmas future

The chance onnx becomes significantly relevant here went from 1% to 15% this week. They're demo'ing ~2x faster inference with Phi-3. There's been fits and starts on LLMs in ONNX for a year, but, with Wintel's AI PC™ push, and all the constituent parts in place (4 bit quants! adaptive quants!), I'd put very good money on it.


So you are saying Ollama is a strong MS acquisition in the future if onnx works out.


no, ONNX is a Microsoft project, I don't know why people know what Ollama is and I don't think they will in a year


I know it is a Microsoft Project. My reasoning is, if Ollama supports ONNX and if it can provide performance on par or better than llama.cpp, it would make sense for Microsoft to acquire Ollama for distribution reasons.


Llama.cpp is the valuable bit here, and Ollama is only good for end user convenience. It saves you 20 minutes of googling and futzing with the million and one llama.cpp wrappers available for every language, lets you set up things to load on startup, but if you're building something for scale or backend, neither llama.cpp or ollama are coming along for the ride. At best it'll live through a proof of concept stage, but as soon as you start caring about performance it's getting discarded.

Microsoft isn't going to pay for something that amounts to a useful setup script wrapped around an inefficient convenience library intended for people to be able to run AI on consumer hardware. There's no exploitable value proposition, whereas building their own closed source AI systems that are tightly coupled to the Windows ecosystem and favor cloud services allows them to extract maximum rent.


Their behaviour around llama.cpp acknowledgement is very shady. Until the very recent, there was no mention of llama.cpp in their README at all and now it's tucked away all the way down. Compare that to the originally proposed PR for example: https://github.com/ollama/ollama/pull/3700


Do you know maybe what are these alternative engines they're talking about? Or is it just a way to evade the fact that at the end of the day it is just a wrapper around llama.cpp?


It was mentioned in another comment to the parent. There are no alternatives currently, the whole thing has been built upon llama.cpp since its inception.


Ollama is great. I actually wish they would wrap OpenAI and Azure and generally act as as a proxy for third party APIs. Having a consistent, well thought out API which isn't tied to a single provider would be really good for the community.

Edit: this would be useful because in many cases some workloads can be local, but others cannot... e.g. if you really need gpt4 for specific queries.


It is open source, so if you want to see this in ollama, pull requests are welcome. :)


Ollama is simply great! I was quite surprised how easy it is to integrate through their API. A simple chat using Ollama + llama3 is less than 40 lines of TypeScript: https://github.com/wiktor-k/llama-chat


Nice! Would there be a way to do that streaming, with streaming voice input too?


Streaming is not a problem (it's just a simple flag: https://github.com/wiktor-k/llama-chat/blob/main/index.ts#L2...) but I've never used voice input.

The examples show image input though: https://github.com/ollama/ollama/blob/main/docs/api.md#reque...

Maybe you can file an issue here: https://github.com/ollama/ollama/issues


I actually just benchmarked Llama3 70B coding with aider, and it did quite well. It scored similar to GPT 3.5.

You can use Llama3 70B with aider via Ollama [0]. It's also available for free via Groq [1] (with rate limits). And OpenRouter has it available [2] for low cost on their paid api.

[0] https://aider.chat/docs/llms.html#ollama

[1] https://aider.chat/docs/llms.html#groq

[2] https://aider.chat/docs/llms.html#openrouter


Paul's benchmarks are excellent and they're the first thing I look for to get a sense of a new model performance :)

For those looking to create their own benchmarks, promptfoo[0] is one way to do this locally:

  prompts:
    - "Write this in Python 3: {{ask}}"
  
  providers:
    - ollama:chat:llama3:8b
    - ollama:chat:phi3
    - ollama:chat:qwen:7b
    
  tests:
    - vars:
        ask: a function to determine if a number is prime
    - vars:
        ask: a function to split a restaurant bill given individual contributions and shared items
Jumping in because I'm a big believer in (1) local LLMs, and (2) evals specific to individual use cases.

[0] https://github.com/typpo/promptfoo



I love working with Ollama, I was really surprised at how easy it is to build a simple RAG system with it. For example: https://github.com/stephen37/ollama_local_rag


Nice, I've been looking out for something like this! What's Jina AI and how is it local if I need an API key for it? Also, this is the first time I'm hearing about poetry. Might be worth including in the prerequisites (unless I can just stick with pip?)


> What's Jina AI and how is it local if I need an API key for it?

Looks like its a vector DB used for creating and looking up embeddings (vectors). LLM is the second part of RAG, the first part is having a good embedding model.


Oh yes good point, it's because at first I was using a local embedding model and then decided to use JinaAI as they have good embeddings.

Jina AI is doing different things but one of them is having embeddings and I use their English/ German embeddings as in one demo I am working with German data.

You can use pip as well but yes, let me add something about Poetry in case people don't know about it :)


[Why] do models require a new version? It can already take arbitrary gguf; I assumed they just had a registry online


They do, and I was using the "new" models before the update. Perhaps there is tuning or bug fixes for them? Or they just want to confirm that these are supported. There are some new models that do have different architectures, so sometimes an update is necessary.


Phi 3 has a unique architecture that needed some additions to llama.cpp's conversion script. Also Phi 3 is an absolute mess, there's no reliable way to latch on to when it's done writing a message and no one wants to admit it, people are patching around it instead.

ex. I could condition on "\n\n<|assistant|>||<|system|>||<|user>", but it'd still be wrong.

Pretty much everything Phi 3 feels like it needed to all come out within 48 hours a month too early. The ONNX genai library doesn't work on Mac, at all, the mobile SDKs don't support it...sigh


Because the way they are quantized takes time to get bug-free when new architectures are released. If a model was quantized with a known bug in the quantizer, then it effectively makes those quantized versions buggy and they need to be requantized with a new version of llamacpp which has this fixed.


Is there a copilot-like autocomplete vscode plugin that uses Ollama?


Yes, the continue.dev plugin can use ollama as backend


Download deepsync coder model via Ollama and connect to the model using CodeGPT plugin?


this looks very awesome. can someone tell me why there is no chatter about this? is there something else out there that blows this out of the water in terms of ease of use and access to sample many LLM's ?


HN isnt really the best space for LLM news - r/LocalLlama and twitter are much better. I think HN has some cultural issues with “AI” news


Hmm I don't think so. Most comments are pretty positive.

I think the articles are just not really upvoted unless it's really big news, makes sense because HN is for more than just AI.

But I don't think it's anti-AI like most people here would be pretty anti-cryptocurrency (and for good reason IMO)


I didn’t upvote it because I don’t use Ollama. To experiment with LLMs I use Huggingface. Does Ollama provide something I cannot get with Huggingface?


Ollama provides a web server with API that just works out of the box, which is great when you want to integrate multiple applications (potentially distributed on smaller edge devices) with LLMs that run on a single beefy machine.

In my home I have a large gaming rig that sometimes runs Ollama+Open WebUI, then I also have a bunch of other services running on a smaller server and a Raspberry Pi which reach out to Ollama for their LLM inference needs.


Sure, maybe it’s better for niche use cases like yours.

HF is the biggest provider of llms, and I guess I haven’t run into it’s limitations yet.


Running locally is sometimes necessary, e.g. you don't want to send sensitive data to any random third party server.


Both Ollama and Huggingface distribute models. The latter sites have model hosting services too, but that isn't the only way to use models from there.


Hugging face is a model repository.

Ollama allows you to run those models.

Different things.


I run models using HF just fine. I mean I’m using HF transformers repo, which gets models from HF hub.

Or do you mean commercial deployment of models for inference?


Are you talking about the Hugging Face Python libraries, the Hugging Face hosted inference APIs, the Hugging Face web interfaces, the Hugging Face iPhone app, Hugging Face Spaces (hosted Docker environments with GPU access) or something else?


I updated my comment above: I’m using HF transformers repo, which gets models from HF hub.


Do you have an NVIDIA GPU? I have not had much luck with the transformers library on a Mac.


Of course. I thought Nvidia GPUs are pretty much a must have to play with DL models.


Well being able to run these models on CPU was pretty much the revolutionary part of llama.cpp.


I can run them on CPU - HF uses plain Pytorch code - fully supported on CPU.


But it's likely to be much slower than what you'd get with a backend like llama.cpp on CPU (particularly if you're running on a Mac, but I think on Linux as well), as well as not supporting features like CPU offloading.


Are there benchmarks? 2x speed up would not be enough for me to return to c++ hell, but 5x might be, in some circumstances.


I think the biggest selling point of ollama (llama.cpp) are quantizations, for a slight hit (with q8 or q4) in quality you can get a significant performance boost.


Does ollama/llama.cpp provide low bit operations (avx or cuda kernels) to speed up inference? Or just model compression with inference still done in fp16?

My understanding is the modern quantization algorithms are typically implemented in Pytorch.


Sorry I don't know much about this topic.

The only thing I know (from using it) that with quantization I can fit models like llama2 13b, in my 24GB of VRAM when I use q8 (16GB) instead of fp16 (26GB). This means I can get nearly the full quality of llama2 13b's output while still being able to use only my GPU, without the need to do very slow inference on only CPU+RAM.

And the models are quantized before inference, so I'd only download 16GB for the llama2 13b q8 instead of the full 26GB, which means it's not done on the fly.


As an aside, even gpt4 level quality does not feel satisfactory to me lately. I can’t imagine willingly using models as dumb as llama2-13b. What do you do with it?


Yeah I agree, everytime a new model releases I download the highest quantization or fp16, that fits into my VRAM, test it out with a few prompts, and then realize that downloadable models are still not as good as the closed ones (except speed wise).

I don't know why I still do it, but everytime I read so many comments how good model X is, and how it outperforms anything else, and then I want to see it for myself.


There's a Python binding for llama.cpp which is actively maintained and has worked well for me: https://github.com/abetlen/llama-cpp-python


Ollama supports many radeons now. And I guess llama.cpp does too, after all it's what ollama uses as backend.


PyTorch (the underlying framework of HF) supports AMD as well, though I haven’t tried it.


Ollama is really organized - it relies on llama but the UX and organization it provides makes it legit. We recently made a one-click wizard to run Open WebUI and Ollama together, self hosted and remotely accessible but locally hosted [1]

[1] https://github.com/ipv6rslimited/cloudseeder


LM Studio is a lot more user friendly, probably the easiest UI to use out there. No terminal nonsense, no manual to read. Just double click and chat. It even explains to you what the model names mean (eg diff between Q4_1 Q4_K Q4_K_M... For whatever reason all the other tools assume you know what it means).

Built-in model recommendations are also handy.

Very friendly tool!

However it's not open-source.


Why do you think there is no chatter about this? There have been hundreds of posts about ollama on HN. This is a point release of an already well known project.


I use a mix of using llamacpp directly via my own python bindings and using it via llamacpp-python for function calling and full control over parameters and loading, but otherwise ollama is just great for ease of use. There's really not a reason not to use it, if just want to load gguf models and don't have any intricate requirements.


I can recommend LM Studio and Msty if you're looking for something with an integrated UX.


Opposite reaction here. I was just thinking, man I hear about Ollama every single day on HN. Not sure a point release is news :-)


Lola a has been brought up many times on HN. It’s a great tool!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: