Which AI Model Should You Use
Voxjar lets you select between several AI Models while creating or editing your scorecards.
We currently support Open Source models by Meta and Deepseek as well as Proprietary LLMs by OpenAi, Anthropic, and Google
Our Open Source models are hosted in the USA. We do not share any data with the creators of the Open Source models. We do not use, or authorize the use of, any customer data to train AI models in any form or fashion.
The best AI model will often be up to interpretation, especially when comparing models of the same caliber. This page will give you information on each of our models so you can make a better decision.
You can also test each model while building your scorecard to see how they each perform on your requirements.
Generally speaking though, larger and newer models perform better.
(You can also nerd out and keep track of the LLM Leaderboards)
If there is another model that you would like us to support, let us know
We support two OpenSource Llama 3.1 models by Meta.
The 70B model is a smaller model that provides a great value to intelligence. This model is going to excel at straight forward questions that require up to a moderate amount of nuance.
The 405B model is the full power Llama 3.1. You can expect excellent performance across the board including high levels of nuance in understanding your questions and requirements.
We support two OpenSource Deepseek models
V3 offers huge value. It is a next gen LLM (2025) with performance on par with GPT-4o, for half the credits. V3 provides nuanced responses and can handle more questions with more challenging context.
R1 is Deepseeks Reasoning AI. It is currently the most capable AI in our library. R1 uses extra layers of thinking to produce the best possible response. Reasoning models are much more compute intensive and are best used for complex or your most important evaluations.
Because Deepseek was trained in China you might notice cultural differences in its responses. We have not run across any that impact its ability to score calls, yet.
Try it out for yourself and let us know!
Some might see Deepseek's origin as a risk since the company is based in China. Voxjar does not interact with that company in any way. We use the OpenSource models hosted in the USA.
GPT-4o mini is a low cost low latency alternative to GPT4o. The tradeoff is that it is not as capable as GPT4o. Mini is a great option if your calls are short and your scorecards are simple.
GPT-4o is arguably the most capable large language model on the market. It is provided by OpenAI and has consistently set the standard. GPT-4o has a high level of nuance in understanding your scorecard questions. This model also has very high throughput.
GPT-5 mini is a faster, cheaper version of GPT-5 for well-defined tasks. Mini is a great option if your calls are short and your scorecards are straightforward.
GPT-5 is the next generation of OpenAI's flagship model after GPT4o. There have been mixed reports on general performance but it is still considered a cutting edge model more capable than most on the market. GPT5 is a thinking model by default so it can use chains of logic to solve problems.
o3 mini is a reasoning model. Reasoning models use extra layers of thinking to deliver superior responses. o3 mini is faster and much cheaper than the o1 series while performing on par with the larger o1 model.
o4 mini is OpenAI's latest Reasoning model. o4 mini is faster and much cheaper than larger reasoning models while performing on par with them. It offers a fantastic value for top-tier intelligence.
You also benefit from any updates that OpenAi makes to these models automatically.
Anthropic are the creators and providers of Claude.ai.
Claude Haiku 3.5 is Anthropic's smaller and faster LLM. Haiku performs best on straightforward scorecards and shorter calls.
Claude Sonnet 3.7-4.5 is Anthropic's flagship LLM and considered by many to have a more human personality and the preference of many users. Sonnet is extremely capable.
These models have lower throughput and can sometimes get bogged down. If your evaluations are time sensitive we recommend using a different model.
Google provides the Gemini Large Language Models. These models are famous for having the largest context window in the industry.
Gemini Flash 2.0 is considered by many to have the best value to intelligence ratio of any LLM. A very capable model that provides great bang-for-your-buck
Gemini Flash 2.5 (Thinking & Non-Thinking) is the newest version of Flash with the option to enable thinking mode. Both versions are very capable with Thinking able to reason and create a chain of thought to better evaluate your interactions. 2.5 can be slower than 2.0 but tends to outperform.
Gemini Pro 2.5 (Thinking & Non-Thinking) is Google's flagship model and is typically considered to be the best model available today. Gemini Pro excels at evaluated long interactions with complex scorecards that require a level of nuance that other models struggle to provide.
Moonshot AI provides an open source model that we host in the USA. Moonshot's LLM, Kimi, is considered to be one of the best LLM's in the word. It also happens to be open source.
Kimi K2 is one of the best LLM's in the world. It is an open source model that performs on par or better than other closed source LLMs. Kimi was designed for agentic work and excels as an AI call evaluator.
Alibaba develops and provides Qwen as an open source LLM. These models are famous for competing head to head with closed source models from Anthropic, OpenAI, and Google. Like all open source models, Voxjar hosts them in the USA and does not send any data to Alibaba.
Qwen3 A22B (Thinking & Non-Thinking) is the newest version of Qwen with the option to enable thinking mode. Both versions are extremely capable. The Thinking version can create a chain of thought to better reason and evaluate your interactions.
β