Skip to main content
All CollectionsScorecards
Which AI Model Should You Use?
Which AI Model Should You Use?
Updated over 3 weeks ago

Which AI Model Should You Use

Voxjar lets you select between several AI Models while creating or editing your scorecards.

We currently support Open Source models by Meta and Deepseek as well as Proprietary LLMs by OpenAi, Anthropic, and Google

Our Open Source models are hosted in the USA. We do not share any data with the creators of the OpenSource models or the providers of the Proprietary models. We also do not use, or authorize the use of, any customer data to train AI models in any form or fashion.

The best AI model will often be up to interpretation, especially when comparing models of the same caliber. This page will give you information on each of our models so you can make a better decision.

You can also test each model while building your scorecard to see how they each perform on your requirements.

Generally speaking though, larger and newer models perform better.

(You can also nerd out and keep track of the LLM Leaderboards)

Good

Better

Best

GPT4-mini

Llama 3.1 405B

Deepseek V3

Llama 3.1 70B

Gemini Pro 1.5

Gemini Flash 2.0

Claude Haiku 3

Claude Sonnet 3.5

GPT-4o

Gemini Flash 1.5

o3 mini

Deepseek R1

If there is another model that you would like us to support, let us know


File:Meta Platforms Inc. logo.svg - Wikipedia

We support two OpenSource Llama 3.1 models by Meta.

  • The 70B model is a smaller model that provides a great value to intelligence. This model is going to excel at straight forward questions that require up to a moderate amount of nuance.

  • The 405B model is the full power Llama 3.1. You can expect excellent performance across the board including high levels of nuance in understanding your questions and requirements.


We support two OpenSource Deepseek models

  • V3 offers huge value. It is a next gen LLM (2025) with performance on par with GPT-4o, for half the credits. V3 provides nuanced responses and can handle more questions with more challenging context.

  • R1 is Deepseeks Reasoning AI. It is currently the most capable AI in our library. R1 uses extra layers of thinking to produce the best possible response. Reasoning models are much more compute intensive and are best used for complex or your most important evaluations.

Because Deepseek was trained in China you might notice cultural differences in its responses. We have not run across any that impact its ability to score calls, yet.

Try it out for yourself and let us know!

Some might see Deepseek's origin as a risk since the company is based in China. Voxjar does not interact with that company in any way. We use the OpenSource models hosted in the USA.


  • GPT-4o mini is a low cost low latency alternative to GPT4o. The tradeoff is that it is not as capable as GPT4o. Mini is a great option if your calls are short and your scorecards are simple.

  • GPT-4o is arguably the most capable large language model on the market. It is provided by OpenAI and has consistently set the standard. GPT-4o has a high level of nuance in understanding your scorecard questions. This model also has very high throughput.

  • o3 mini is OpenAI's most recent Reasoning model. Reasoning models use extra layers of thinking to deliver superior responses. o3 mini is faster and much cheaper than the o1 series while performing on par with the larger o1 model. It is one of the smartest models in the Voxjar library and offers a fantastic value for that intelligence.

You also benefit from any updates that OpenAi makes to these models automatically.


Anthropic are the creators and providers of Claude.ai.

  • Claude Haiku 3 is Anthropic's smaller and faster LLM. Haiku performs best on straightforward scorecards and shorter calls.

  • Claude Sonnet 3.5 is Anthropic's flagship LLM and considered by many to be superior to GPT-4o. Sonnet handles complicated context well and

These models have lower throughput and can sometimes get bogged down. If your evaluations are time sensitive we recommend using a different model.


Google provides the Gemini Large Language Models. These models are famous for having the largest context window in the industry.

  • Gemini Flash 1.5 is Google's smaller, faster LLM. Great for simpler scorecards with less demanding contextual awareness. This model will be replaced by v2 when Flash 2.0 is generally available.

  • Gemini Pro 1.5 is Google's current flagship model. You can expect excellent performance. It excels at handle long phone calls and more complex scorecards.

  • Gemini Flash 2.0 is Google's first version 2 LLM to be made publicly available. Flash 2.0 is still an experimental model, but already outperforms Pro 1.5 in many areas and is twice as fast. Although, Pro 1.5 is still better at handling longer phone calls and scorecards.

Google will be release other 2.0 models in the near future, including a reasoning model. We plan to replace the 1.5 models when v2 is considered stable and generally available.


​
​

Did this answer your question?