Three AI engines walk into a bar in single file… • The Register

Developers looking to gain a better understanding of machine learning inference on local hardware can fire up a new llama engine.

Software developer Leonardo Russo has released llama3pure, which incorporates three standalone inference engines. There’s a pure C implementation for desktops, a pure JavaScript implementation for Node.js, and a pure JavaScript version for web browsers that don’t require WebAssembly.

“All versions are compatible with the Llama and Gemma architectures,” Russo explained to The Register in an email. “The goal is to provide a dependency-free, isolated alternative in both C and JavaScript capable of reading GGUF files and processing prompts.”

GGUF stands for GPT-Generated Unified Format; it is a common format for distributing machine learning models.

Llama3pure is not intended as a replacement for llama.cpp, a widely used inference engine for running local models that’s significantly faster at responding to prompts. Llama3pure is an educational tool.

“I see llama3pure as a more flexible alternative to llama.cpp specifically when it comes to architectural transparency and broad hardware compatibility,” Russo explained. “While llama.cpp is the standard for high-performance optimization, it involves a complex ecosystem of dependencies and build configurations, llama3pure takes a different approach.”

Russo believes developers can benefit from having an inference engine in a single, human-readable file that makes evident the logic of file-parsing and token generation.

“The project’s main purpose is to provide an inference engine contained within a single file of pure code,” he said. “By removing external dependencies and layers of abstraction, it allows developers to grasp the entire execution flow – from GGUF parsing to the final token – without jumping between files or libraries. It’s built for those who need to understand exactly what the hardware is doing.”

Russo also sees utility for situations where the developer is running legacy software or hardware, where client-side WebAssembly isn’t an option, and where having an isolated tool without the potential for future dependency conflicts might be desirable.

The C and Node.js engines, he said, have been tested with Llama models up to 8 billion parameters and with Gemma models up to 4 billion parameters. The main limiting factor is the physical RAM required to host model weights.

The RAM required to run machine learning models on local hardware is roughly 1GB per billion parameters when the model is quantized at 8 bits. Double or halve the precision and you double or halve the memory required. Models are commonly quantized at 16 bits, so for a 1 billion-parameter model, 2GB would typically be required.

According to Russo, the calculation for GGUF weights is different.

“GGUF weights are loaded directly into RAM, which usually means the RAM usage matches the entire file size,” he explained. “You can reduce the context window size by passing a specific parameter (context_size) – a feature supported by most inference engines, including the three I designed. While reducing the context window size this is a common ‘trick’ to save RAM when running models locally, it also means the AI won’t ‘remember’ as much as it was originally designed to.”

He also said that llama3pure is presently focused on single-turn inference. He expects to implement chat history state management at a later date.

For daily work, Russo says he uses Gemma 3 as a personal assistant, powered by his C-based inference engine, to ensure that sensitive data is handled privately and offline.

“For a coding assistant, I recommend Gemma 3 27B,” he said. “Regarding the latency concerns, while local models were historically slow, running optimized versions on modern hardware now provides an experience very close to cloud-based models like Claude and without the need to pay for such a service.”

While Russo expects common general use cases for AI assistance will continue to rely on cloud-hosted models, he foresees developers and businesses looking increasingly at local AI. While developer machines with 32GB or 48GB of RAM may lack the context window available with cloud-hosted models, they provide security and privacy without being dependent on service providers.

Asked how he feels as a developer about the AI transition, Russo said he expects developers to eventually transition to AI supervisors.

“Since AI models present answers with high confidence – even when incorrect – a human expert must remain in the loop to verify the output,” he said. “Technical knowledge will not become obsolete; rather, it will become increasingly vital for auditing AI-generated work.

“While job titles may change, senior developers will always be necessary to maintain these systems, creating a workflow significantly faster than human-only development. For junior and mid-level developers, AI offers the opportunity to learn faster than previous generations. If managed correctly, AI can facilitate a significant leap in the industry’s intellectual evolution.” ®

Source link