Skip to content

NexaAI/nexa-sdk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nexa SDK

Nexa SDK is an on-device inference framework that runs any model on any device, across any backend. It runs on CPUs, GPUs, NPUs with backend support for CUDA, Metal, Vulkan, and Qualcomm NPU. It handles multiple input modalities including text 📝, image 🖼️, and audio 🎧. The SDK includes an OpenAI-compatible API server with support for JSON schema-based function calling and streaming. It supports model formats such as GGUF, MLX, Nexa AI's own .nexa format, enabling efficient quantized inference across diverse platforms.

Qualcomm NPU PC Demos

Multi-Image Reasoning Demo

🖼️ Multi-Image Reasoning
Spot the difference across two images in multi-round dialogue.

Image + Audio Function Call Demo

🎤 Image + Text → Function Call
Snap a poster, add a voice note, and AI agent creates a calendar event.

Multi-Audio Comparison Demo

🎶 Multi-Audio Comparison
Tell the difference between two music clips locally.

Recent updates

📣 2025.08.20: Qualcomm NPU Support

📣 **2025.08.12: ASR & TTS Support in MLX format

  • ASR & TTS model support in MLX format.
  • new "> /mic" mode to transcribe live speech directly in your terminal.

Installation

macOS

Windows

Linux

curl -fsSL https://raw.githubusercontent.com/NexaAI/nexa-sdk/main/release/linux/install.sh -o install.sh && chmod +x install.sh && ./install.sh

Supported Models

You can run any compatible GGUF,MLX, or nexa model from 🤗 Hugging Face by using the <full repo name>.

Qualcomm NPU models

Tip

You need to download the arm64 with Qualcomm NPU support and make sure you have Snapdragon® X Elite chip on your laptop.

🖼️ Run and chat with our multimodal model, OmniNeural-4B:

nexa infer omni-neural
nexa infer NexaAI/OmniNeural-4B

GGUF models

Tip

GGUF runs on macOS, Linux, and Windows.

📝 Run and chat with LLMs, e.g. Qwen3:

nexa infer ggml-org/Qwen3-1.7B-GGUF

🖼️ Run and chat with Multimodal models, e.g. Qwen2.5-Omni:

nexa infer NexaAI/Qwen2.5-Omni-3B-GGUF

MLX models

Tip

MLX is macOS-only (Apple Silicon). Many MLX models in the Hugging Face mlx-community organization have quality issues and may not run reliably. We recommend starting with models from our curated NexaAI Collection for best results. For example

📝 Run and chat with LLMs, e.g. Qwen3:

nexa infer NexaAI/Qwen3-4B-4bit-MLX

🖼️ Run and chat with Multimodal models, e.g. Gemma3n:

nexa infer NexaAI/gemma-3n-E4B-it-4bit-MLX

CLI Reference

Essential Command What it does
nexa -h show all CLI commands
nexa pull <repo> Interactive download & cache of a model
nexa infer <repo> Local inference
nexa list Show all cached models with sizes
nexa remove <repo> / nexa clean Delete one / all cached models
nexa serve --host 127.0.0.1:8080 Launch OpenAI‑compatible REST server
nexa run <repo> Chat with a model via an existing server

👉 To interact with multimodal models, you can drag photos or audio clips directly into the CLI — you can even drop multiple images at once!

See CLI Reference for full commands.

Acknowledgements

We would like to thank the following projects: