Monday, April 20, 2026

OmniHai counts the cost

OmniHai 1.4 is out! After 1.1 gave the library ears, 1.2 a voice, and 1.3 the ability to step outside and browse the web, 1.4 teaches it to count. Token usage becomes actual money in your preferred currency, runaway spend can be capped, reasoning effort is now dial-able across providers, and ChatOptions knows how to serialize itself to portable JSON.

<dependency>
    <groupId>org.omnifaces</groupId>
    <artifactId>omnihai</artifactId>
    <version>1.4</version>
</dependency>

Cost Calculation

1.3 introduced ChatUsage so you could see how many tokens a call consumed. Useful, but tokens are not what the invoice at the end of the month is denominated in. 1.4 closes that gap with ChatPricing and ChatCost.

Attach a pricing to your ChatOptions, make a call, read back the cost:

ChatPricing pricing = new ChatPricing(
    new BigDecimal("3.00"),  // input price per 1M tokens
    new BigDecimal("0.30"),  // cached-input price per 1M tokens (optional)
    new BigDecimal("15.00"), // output price per 1M tokens (includes reasoning)
    Currency.getInstance("USD"));

ChatOptions options = ChatOptions.newBuilder()
    .pricing(pricing)
    .build();

String response = service.chat("Explain quantum computing", options);

ChatCost cost = options.getLastCost();
System.out.println("Input cost:        " + cost.inputCost());
System.out.println("Cached input cost: " + cost.cachedInputCost());
System.out.println("Output cost:       " + cost.outputCost());
System.out.println("Total cost:        " + cost.totalCost() + " " + cost.currency());

Prices are expressed per one million tokens to match how providers publish their rate sheets. There are deliberately no built-in rate presets; provider rates drift and differ per model tier, so you look up the current numbers for your chosen model and pass them in. The optional currency is passed through to ChatCost for display; it does not affect any arithmetic, so use whatever unit you supplied the prices in.

The cachedInputTokenPrice is optional. When null, cached tokens are billed at the regular input rate. Set it explicitly to reflect the provider's cache-read discount (Anthropic charges roughly 10% of the input rate for cache reads, OpenAI and Google roughly 25%). Reasoning tokens are always billed at the output rate, consistent with how providers invoice them.

If you want the full positional constructor to be a bit less ceremonial, there are two factory methods:

ChatPricing simple = ChatPricing.of(new BigDecimal("3.00"), new BigDecimal("15.00"));
ChatPricing withCache = ChatPricing.of(new BigDecimal("3.00"), new BigDecimal("0.30"), new BigDecimal("15.00"));

And if you have a ChatUsage in hand and want the cost ad-hoc without configuring options at all:

ChatCost cost = usage.calculateCost(pricing);

One caveat worth mentioning up front: this is a simplified three-tier scheme (base input, cached input, output) that covers the common case. Provider-specific billing axes like Anthropic's 5-minute and 1-hour cache-write premiums are not modeled and may cause under-counting for workloads that rely heavily on explicit prompt caching. For strict accuracy, reconcile against the provider's own billing API. For "roughly what did that call cost me" it is good enough.

Budget Cap

Cost visibility is nice. Cost protection is nicer. 1.4 also lets you attach a cumulative-cost ceiling alongside the pricing so runaway spend on a given ChatOptions instance gets stopped rather than logged after the fact:

ChatOptions options = ChatOptions.newBuilder()
    .pricing(pricing, new BigDecimal("1.00")) // hard stop at $1.00
    .build();

while (hasMoreWork()) {
    try {
        service.chat(next(), options);
    } catch (AIBudgetExceededException e) {
        log.warn("Spent {} of {} {} — stopping", e.getTotalCost(), e.getMaxTotalCost(), e.getCurrency());
        break;
    }
}

The cap is checked before each call using the accumulated ChatOptions.getTotalCost(). It is a soft ceiling: the call that pushes the running total at or over the cap still completes and is billed; the next call is refused with AIBudgetExceededException. That keeps the behavior predictable; the alternative of estimating an upcoming call's cost before dispatching it would require knowing the output token count in advance, which of course you don't.

After you have caught the exception, you can call options.resetBudget() to zero the counter and start a fresh window on the same instance, or switch to a different ChatOptions instance, or even fail over to a different AIService (e.g. a cheaper model) to continue processing.

Cached Input Tokens

While we are on the subject of prompt caches, ChatUsage has gained a fourth field: cachedInputTokens().

ChatUsage usage = options.getLastUsage();
System.out.println("Input tokens:         " + usage.inputTokens());
System.out.println("Cached input tokens:  " + usage.cachedInputTokens()); // subset of inputTokens
System.out.println("Output tokens:        " + usage.outputTokens());
System.out.println("Reasoning tokens:     " + usage.reasoningTokens());   // subset of outputTokens
System.out.println("Total tokens:         " + usage.totalTokens());

It reports the subset of input tokens that was served from the provider's prompt cache. This is the number that drives the cheaper cachedInputCost on ChatCost, and it is useful on its own too; a low cache-hit ratio on a workload that should mostly be reused content is a good signal that your system prompts are drifting or the provider's cache TTL has elapsed. As with the other fields, a value of -1 means the provider did not report it.

Reasoning Effort

Modern frontier models (GPT-5, Claude extended thinking, Gemini thinking, Grok reasoning) all let you tune how many tokens they should spend on internal reasoning before answering. The knobs are called different things across providers; in OmniHai they live behind a single enum:

ChatOptions options = ChatOptions.newBuilder()
    .reasoningEffort(ReasoningEffort.HIGH)
    .build();

String answer = service.chat("Prove the Pythagorean theorem.", options);

The available levels are AUTO (the default, defers to the provider's own default), NONE (actively disable reasoning where supported, for minimum cost and latency), LOW (~20% of budget), MEDIUM (~50% of budget), HIGH (~80% of budget), and XHIGH (~95% of budget). Providers that do not support a given level map to the closest equivalent, so you can leave the same ChatOptions in place while switching the underlying provider.

Higher levels typically improve answer quality on hard problems (math, multi-step planning, non-trivial code) at the cost of more tokens and latency. On trivial prompts they just spend money without any measurable upside, so do not set HIGH or XHIGH as the default for all your calls :) Keep in mind that a higher effort may also require a correspondingly higher maxTokens to avoid truncated responses.

Portable JSON for ChatOptions

ChatOptions has been Serializable since day one, which is enough to stash it in an HTTP session. For portable storage, REST payloads, JSON columns, audit logs, or cross-service transport, Java serialization is not what you want. 1.4 adds an explicit JSON form:

String json = options.toJson();
ChatOptions restored = ChatOptions.fromJson(json);

All user-facing settings are included: system prompt, JSON schema, temperature, maxTokens, reasoning effort, topP, web search location, pricing, maxTotalCost, maxHistory, and the full conversation history (including any recorded uploaded file references). Null or unset fields are omitted for a compact payload. Runtime state, the last usage and the cumulative total cost, is deliberately not serialized; a restored instance starts with a fresh zero total cost counter.

Round-tripping a shared default constant (DEFAULT, CREATIVE, DETERMINISTIC) yields a mutable copy, equivalent to calling copy(). That way you do never accidentally end up with a restored instance that still rejects mutations because it was derived from an immutable template.

Default Models

Under the hood, default model identifiers per provider have been refreshed to match the current state of technology. The exact identifiers are documented on the GitHub README. If you were relying on the provider default, you get the newer model automatically on upgrade; if you were pinning a specific model, nothing changes for you.

Getting 1.4

Non-Maven users: download the OmniHai 1.4 JAR and drop it in /WEB-INF/lib the usual way, replacing the older version if any.

Maven users:

<dependency>
    <groupId>org.omnifaces</groupId>
    <artifactId>omnihai</artifactId>
    <version>1.4</version>
</dependency>

Give It a Try

As always, feedback and contributions are welcome on GitHub. If you run into any issues, open an issue. Pull requests are welcome too.