Java LLM Inference Guide

This guide covers loading and running large language models (LLaMA, BERT) from Java using SKaiNET's blocking, streaming, and async APIs.

Prerequisites

JDK 21+ with --enable-preview --add-modules jdk.incubator.vector
See Java Getting Started for project setup

Maven Dependencies

<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>sk.ainet</groupId>
            <artifactId>skainet-bom</artifactId>
            <version>0.13.0</version>
            <type>pom</type>
            <scope>import</scope>
        </dependency>
    </dependencies>
</dependencyManagement>

<dependencies>
    <!-- LLaMA inference -->
    <dependency>
        <groupId>sk.ainet</groupId>
        <artifactId>skainet-kllama-jvm</artifactId>
    </dependency>

    <!-- Agent / tool-calling (optional) -->
    <dependency>
        <groupId>sk.ainet</groupId>
        <artifactId>skainet-kllama-agent-jvm</artifactId>
    </dependency>

    <!-- BERT embeddings (optional) -->
    <dependency>
        <groupId>sk.ainet</groupId>
        <artifactId>skainet-bert-jvm</artifactId>
    </dependency>

    <!-- CPU backend -->
    <dependency>
        <groupId>sk.ainet</groupId>
        <artifactId>skainet-backend-cpu-jvm</artifactId>
    </dependency>
</dependencies>

LLaMA Inference

All LLaMA Java classes live in sk.ainet.apps.kllama.java.

Loading a GGUF Model

The simplest way to get started is to load a GGUF file. KLlamaJava.loadGGUF() handles context creation, weight loading, quantization dispatch, and tokenizer setup behind the scenes.

import sk.ainet.apps.kllama.java.KLlamaJava;
import sk.ainet.apps.kllama.java.KLlamaSession;
import sk.ainet.apps.kllama.java.GenerationConfig;
import java.nio.file.Path;

public class LlamaExample {
    public static void main(String[] args) {
        try (KLlamaSession session = KLlamaJava.loadGGUF(Path.of("tinyllama-1.1b-q4.gguf"))) {
            String response = session.generate("The capital of France is");
            System.out.println(response);
        }
    }
}

KLlamaSession implements AutoCloseable, so try-with-resources properly releases the off-heap memory arenas when you are done.

Loading SafeTensors (HuggingFace Format)

If you have a HuggingFace model directory containing model.safetensors, config.json, and tokenizer.json:

try (KLlamaSession session = KLlamaJava.loadSafeTensors(Path.of("./my-llama-model/"))) {
    String response = session.generate("Once upon a time");
    System.out.println(response);
}

The directory must contain:

model.safetensors -- the model weights
config.json -- model architecture config (hidden size, layers, heads, etc.)
tokenizer.json -- HuggingFace tokenizer definition

GenerationConfig

Control generation parameters with the builder pattern:

GenerationConfig config = GenerationConfig.builder()
        .maxTokens(256)       // maximum tokens to generate (default: 256)
        .temperature(0.7f)    // sampling temperature (default: 0.8)
        .build();

String response = session.generate("Explain quantum computing", config);

Use GenerationConfig.defaults() for the default configuration (256 max tokens, 0.8 temperature).

Streaming Generation

Pass a Consumer<String> to receive each token as it is generated. This is useful for displaying output in real time:

GenerationConfig config = GenerationConfig.builder()
        .maxTokens(512)
        .temperature(0.9f)
        .build();

String fullResponse = session.generate(
        "Write a haiku about Java",
        config,
        token -> System.out.print(token)  // stream tokens to stdout
);

System.out.println();  // newline after streaming

The generate overload with a Consumer<String> still returns the complete generated text as its return value.

Async Generation

generateAsync offloads generation to a virtual thread and returns a CompletableFuture<String>:

import java.util.concurrent.CompletableFuture;

CompletableFuture<String> future = session.generateAsync(
        "Summarize the theory of relativity",
        GenerationConfig.builder().maxTokens(200).build()
);

// Do other work while generation runs...
String result = future.join();  // block when you need the result
System.out.println(result);

You can also compose futures:

session.generateAsync("Translate to French: Hello world")
       .thenAccept(translation -> System.out.println("Translation: " + translation))
       .exceptionally(ex -> { ex.printStackTrace(); return null; });

BERT Encoding and Similarity

All BERT Java classes live in sk.ainet.apps.bert.java.

Loading a BERT Model

Load a BERT model from a HuggingFace directory containing model.safetensors and vocab.txt:

import sk.ainet.apps.bert.java.KBertJava;
import sk.ainet.apps.bert.java.KBertSession;
import java.nio.file.Path;

try (KBertSession bert = KBertJava.loadSafeTensors(Path.of("./bert-base-uncased/"))) {
    // Encode text into an embedding vector
    float[] embedding = bert.encode("SKaiNET is a tensor framework");
    System.out.println("Embedding dimension: " + embedding.length);
}

The directory must contain:

model.safetensors -- BERT model weights
vocab.txt -- WordPiece vocabulary
config.json (optional) -- model config; defaults are used if absent

Similarity Scoring

Compute cosine similarity between two texts directly:

try (KBertSession bert = KBertJava.loadSafeTensors(Path.of("./bert-base-uncased/"))) {
    float score = bert.similarity(
            "The cat sat on the mat",
            "A kitten rested on the rug"
    );
    System.out.printf("Similarity: %.4f%n", score);  // e.g. 0.8923

    // Compare unrelated texts
    float low = bert.similarity(
            "The cat sat on the mat",
            "Stock prices rose sharply"
    );
    System.out.printf("Unrelated:  %.4f%n", low);    // e.g. 0.1247
}

The returned value is cosine similarity in the range [-1, 1].

Agent Loop and Tool Calling

All agent/tool classes live in sk.ainet.apps.kllama.chat.java.

The JavaAgentLoop lets the LLM call tools in a loop until it produces a final answer. You define tools by implementing the JavaTool interface.

Defining a Tool

import sk.ainet.apps.kllama.chat.java.JavaTool;
import sk.ainet.apps.kllama.chat.ToolDefinition;
import java.util.Map;

public class CalculatorTool implements JavaTool {

    @Override
    public ToolDefinition getDefinition() {
        return new ToolDefinition(
                "calculator",
                "Evaluate a mathematical expression",
                Map.of(
                    "expression", Map.of(
                        "type", "string",
                        "description", "The math expression to evaluate"
                    )
                )
        );
    }

    @Override
    public String execute(Map<String, Object> arguments) {
        String expr = (String) arguments.get("expression");
        // Your evaluation logic here
        double result = evaluate(expr);
        return String.valueOf(result);
    }

    private double evaluate(String expr) {
        // Simple evaluation implementation
        // ...
        return 0.0;
    }
}

Building and Using the Agent

import sk.ainet.apps.kllama.java.KLlamaJava;
import sk.ainet.apps.kllama.java.KLlamaSession;
import sk.ainet.apps.kllama.chat.java.JavaAgentLoop;
import java.nio.file.Path;

try (KLlamaSession session = KLlamaJava.loadGGUF(Path.of("model.gguf"))) {

    JavaAgentLoop agent = JavaAgentLoop.builder()
            .session(session)
            .tool(new CalculatorTool())
            .systemPrompt("You are a helpful assistant with access to a calculator.")
            .template("llama3")    // or "chatml"
            .build();

    // The agent will call the calculator tool if needed
    String answer = agent.chat("What is 42 * 17?");
    System.out.println(answer);

    // Multi-turn conversation -- context is preserved
    String followUp = agent.chat("Now divide that result by 3");
    System.out.println(followUp);

    // Reset conversation history (keeps system prompt)
    agent.reset();
}

Streaming Agent Responses

String answer = agent.chat(
        "What is the square root of 144?",
        token -> System.out.print(token)
);

Resource Management

Both KLlamaSession and KBertSession implement AutoCloseable. Always use try-with-resources to ensure off-heap memory arenas and other native resources are released promptly:

// Single session
try (KLlamaSession session = KLlamaJava.loadGGUF(path)) {
    session.generate("Hello");
}

// Multiple sessions
try (KLlamaSession llama = KLlamaJava.loadGGUF(llamaPath);
     KBertSession bert = KBertJava.loadSafeTensors(bertPath)) {

    String text = llama.generate("Write a summary of quantum mechanics");
    float[] embedding = bert.encode(text);
}

Failing to close sessions will leak off-heap memory allocated via java.lang.foreign.Arena.

Package Reference

Package	Key Classes
`sk.ainet.apps.kllama.java`	`KLlamaJava`, `KLlamaSession`, `GenerationConfig`
`sk.ainet.apps.bert.java`	`KBertJava`, `KBertSession`
`sk.ainet.apps.kllama.chat.java`	`JavaAgentLoop`, `JavaTool`

Next Steps

Java Getting Started -- tensor operations, project setup, and dependency management.
Model Training Guide -- build and train neural networks from Java.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Java LLM Inference Guide

Prerequisites

Maven Dependencies

LLaMA Inference

Loading a GGUF Model

Loading SafeTensors (HuggingFace Format)

GenerationConfig

Streaming Generation

Async Generation

BERT Encoding and Similarity

Loading a BERT Model

Similarity Scoring

Agent Loop and Tool Calling

Defining a Tool

Building and Using the Agent

Streaming Agent Responses

Resource Management

Package Reference

Next Steps

FilesExpand file tree

java-llm-inference.md

Latest commit

History

java-llm-inference.md

File metadata and controls

Java LLM Inference Guide

Prerequisites

Maven Dependencies

LLaMA Inference

Loading a GGUF Model

Loading SafeTensors (HuggingFace Format)

GenerationConfig

Streaming Generation

Async Generation

BERT Encoding and Similarity

Loading a BERT Model

Similarity Scoring

Agent Loop and Tool Calling

Defining a Tool

Building and Using the Agent

Streaming Agent Responses

Resource Management

Package Reference

Next Steps