OmniHai 1.2 is out. After 1.1 gave the library ears and the ability to transcribe audio, 1.2 completes the picture by giving it a voice. Text-to-Speech (TTS) joins the API alongside some improvements around file handling and conversation history to make life easier.
Text-to-Speech
Audio generation is now a first-class citizen alongside audio transcription.
The simplest form is a single generateAudio() method call that returns raw audio bytes:
byte[] audio = service.generateAudio("Welcome to OmniHai 1.2!");
If you want to stream the result directly to a file without buffering it in memory, pass a Path instead:
service.generateAudio("Welcome to OmniHai 1.2!", Path.of("/path/to/welcome.mp3"));
Of course also available as async method: generateAudioAsync().
So far, only OpenAI and Google AI are supported. Other providers are not supported simply because they do not offer any HTTP based API endpoint yet (Anthropic, Mistral, Meta AI, Azure, OpenRouter, Ollama), or do not offer an unified API which is compatible with all TTS models (HuggingFace), or only expose it through WebSocket streaming (xAI), which does not fit the request/response model OmniHai is built on. Support will be added as providers introduce one.
Customization is available through the new GenerateAudioOptions builder, which exposes voice, speed, and output format. Here's an example for OpenAI GPT:
var options = GenerateAudioOptions.newBuilder()
.voice("nova")
.speed(1.25)
.outputFormat("opus")
.build();
gpt.generateAudio("Welcome to OmniHai 1.2!", Path.of("/path/to/welcome.opus"), options);
One thing worth mentioning about Google AI: Gemini returns raw PCM audio rather than a proper audio container. OmniHai transparently adds a WAV header before handing back the result, so callers get consistent, playable audio regardless of which provider is used.
Transcribing from a Path
The existing transcription API only accepted byte arrays, which meant loading the entire audio file into memory before making the API call.
In 1.2, transcribe() and transcribeAsync() now also accept a Path:
String transcript = service.transcribe(Path.of("/recordings/meeting.mp3"));
For providers that support a files API the file is streamed directly from disk, no intermediate copy and no heap pressure. The byte array overload from 1.1 still works exactly as before.
Path-backed File Attachments
The same improvement extends to chat attachments.
ChatInput.Builder#attach() now accepts Path arguments alongside the existing byte array support:
var input = ChatInput.newBuilder()
.message("Summarize this contract.")
.attach(Path.of("/path/to/contract.pdf"))
.build();
var summary = service.chat(input);
MIME type detection still reads only the magic bytes rather than assuming it based on the file extension, and the upload itself streams the content from disk. For large PDFs or images this is a meaningful difference, and it requires no change to calling code beyond passing a path instead of a byte array.
History Initialisation
Conversation memory has always lived in ChatOptions, but there was no way to seed it with a prior exchange.
In 1.2 the ChatOptions.Builder gains a history() method that accepts an existing message list:
// At the end of a session, persist the history
List<Message> saved = options.getHistory();
// On the next session, restore it
var options = ChatOptions.newBuilder()
.systemPrompt("You are a helpful assistant.")
.withMemory()
.history(saved)
.build();
var response = service.chat("Where were we?", options);
This makes it straightforward to persist a conversation to a database, load it back on the next session, and hand it straight to the service without any manual message reconstruction.
Getting 1.2
Add the following dependency to your project and you are ready to go:
<dependency>
<groupId>org.omnifaces</groupId>
<artifactId>omnihai</artifactId>
<version>1.2</version>
</dependency>
Feedback and contributions are welcome on the GitHub repository.

No comments:
Post a Comment