This guide covers loading and running large language models (LLaMA, BERT) from Java using SKaiNET's blocking, streaming, and async APIs.
- JDK 21+ with
--enable-preview --add-modules jdk.incubator.vector - See Java Getting Started for project setup
<dependencyManagement>
<dependencies>
<dependency>
<groupId>sk.ainet</groupId>
<artifactId>skainet-bom</artifactId>
<version>0.13.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<!-- LLaMA inference -->
<dependency>
<groupId>sk.ainet</groupId>
<artifactId>skainet-kllama-jvm</artifactId>
</dependency>
<!-- Agent / tool-calling (optional) -->
<dependency>
<groupId>sk.ainet</groupId>
<artifactId>skainet-kllama-agent-jvm</artifactId>
</dependency>
<!-- BERT embeddings (optional) -->
<dependency>
<groupId>sk.ainet</groupId>
<artifactId>skainet-bert-jvm</artifactId>
</dependency>
<!-- CPU backend -->
<dependency>
<groupId>sk.ainet</groupId>
<artifactId>skainet-backend-cpu-jvm</artifactId>
</dependency>
</dependencies>All LLaMA Java classes live in sk.ainet.apps.kllama.java.
The simplest way to get started is to load a GGUF file. KLlamaJava.loadGGUF() handles context creation, weight loading, quantization dispatch, and tokenizer setup behind the scenes.
import sk.ainet.apps.kllama.java.KLlamaJava;
import sk.ainet.apps.kllama.java.KLlamaSession;
import sk.ainet.apps.kllama.java.GenerationConfig;
import java.nio.file.Path;
public class LlamaExample {
public static void main(String[] args) {
try (KLlamaSession session = KLlamaJava.loadGGUF(Path.of("tinyllama-1.1b-q4.gguf"))) {
String response = session.generate("The capital of France is");
System.out.println(response);
}
}
}KLlamaSession implements AutoCloseable, so try-with-resources properly releases the off-heap memory arenas when you are done.
If you have a HuggingFace model directory containing model.safetensors, config.json, and tokenizer.json:
try (KLlamaSession session = KLlamaJava.loadSafeTensors(Path.of("./my-llama-model/"))) {
String response = session.generate("Once upon a time");
System.out.println(response);
}The directory must contain:
model.safetensors-- the model weightsconfig.json-- model architecture config (hidden size, layers, heads, etc.)tokenizer.json-- HuggingFace tokenizer definition
Control generation parameters with the builder pattern:
GenerationConfig config = GenerationConfig.builder()
.maxTokens(256) // maximum tokens to generate (default: 256)
.temperature(0.7f) // sampling temperature (default: 0.8)
.build();
String response = session.generate("Explain quantum computing", config);Use GenerationConfig.defaults() for the default configuration (256 max tokens, 0.8 temperature).
Pass a Consumer<String> to receive each token as it is generated. This is useful for displaying output in real time:
GenerationConfig config = GenerationConfig.builder()
.maxTokens(512)
.temperature(0.9f)
.build();
String fullResponse = session.generate(
"Write a haiku about Java",
config,
token -> System.out.print(token) // stream tokens to stdout
);
System.out.println(); // newline after streamingThe generate overload with a Consumer<String> still returns the complete generated text as its return value.
generateAsync offloads generation to a virtual thread and returns a CompletableFuture<String>:
import java.util.concurrent.CompletableFuture;
CompletableFuture<String> future = session.generateAsync(
"Summarize the theory of relativity",
GenerationConfig.builder().maxTokens(200).build()
);
// Do other work while generation runs...
String result = future.join(); // block when you need the result
System.out.println(result);You can also compose futures:
session.generateAsync("Translate to French: Hello world")
.thenAccept(translation -> System.out.println("Translation: " + translation))
.exceptionally(ex -> { ex.printStackTrace(); return null; });All BERT Java classes live in sk.ainet.apps.bert.java.
Load a BERT model from a HuggingFace directory containing model.safetensors and vocab.txt:
import sk.ainet.apps.bert.java.KBertJava;
import sk.ainet.apps.bert.java.KBertSession;
import java.nio.file.Path;
try (KBertSession bert = KBertJava.loadSafeTensors(Path.of("./bert-base-uncased/"))) {
// Encode text into an embedding vector
float[] embedding = bert.encode("SKaiNET is a tensor framework");
System.out.println("Embedding dimension: " + embedding.length);
}The directory must contain:
model.safetensors-- BERT model weightsvocab.txt-- WordPiece vocabularyconfig.json(optional) -- model config; defaults are used if absent
Compute cosine similarity between two texts directly:
try (KBertSession bert = KBertJava.loadSafeTensors(Path.of("./bert-base-uncased/"))) {
float score = bert.similarity(
"The cat sat on the mat",
"A kitten rested on the rug"
);
System.out.printf("Similarity: %.4f%n", score); // e.g. 0.8923
// Compare unrelated texts
float low = bert.similarity(
"The cat sat on the mat",
"Stock prices rose sharply"
);
System.out.printf("Unrelated: %.4f%n", low); // e.g. 0.1247
}The returned value is cosine similarity in the range [-1, 1].
All agent/tool classes live in sk.ainet.apps.kllama.chat.java.
The JavaAgentLoop lets the LLM call tools in a loop until it produces a final answer. You define tools by implementing the JavaTool interface.
import sk.ainet.apps.kllama.chat.java.JavaTool;
import sk.ainet.apps.kllama.chat.ToolDefinition;
import java.util.Map;
public class CalculatorTool implements JavaTool {
@Override
public ToolDefinition getDefinition() {
return new ToolDefinition(
"calculator",
"Evaluate a mathematical expression",
Map.of(
"expression", Map.of(
"type", "string",
"description", "The math expression to evaluate"
)
)
);
}
@Override
public String execute(Map<String, Object> arguments) {
String expr = (String) arguments.get("expression");
// Your evaluation logic here
double result = evaluate(expr);
return String.valueOf(result);
}
private double evaluate(String expr) {
// Simple evaluation implementation
// ...
return 0.0;
}
}import sk.ainet.apps.kllama.java.KLlamaJava;
import sk.ainet.apps.kllama.java.KLlamaSession;
import sk.ainet.apps.kllama.chat.java.JavaAgentLoop;
import java.nio.file.Path;
try (KLlamaSession session = KLlamaJava.loadGGUF(Path.of("model.gguf"))) {
JavaAgentLoop agent = JavaAgentLoop.builder()
.session(session)
.tool(new CalculatorTool())
.systemPrompt("You are a helpful assistant with access to a calculator.")
.template("llama3") // or "chatml"
.build();
// The agent will call the calculator tool if needed
String answer = agent.chat("What is 42 * 17?");
System.out.println(answer);
// Multi-turn conversation -- context is preserved
String followUp = agent.chat("Now divide that result by 3");
System.out.println(followUp);
// Reset conversation history (keeps system prompt)
agent.reset();
}String answer = agent.chat(
"What is the square root of 144?",
token -> System.out.print(token)
);Both KLlamaSession and KBertSession implement AutoCloseable. Always use try-with-resources to ensure off-heap memory arenas and other native resources are released promptly:
// Single session
try (KLlamaSession session = KLlamaJava.loadGGUF(path)) {
session.generate("Hello");
}
// Multiple sessions
try (KLlamaSession llama = KLlamaJava.loadGGUF(llamaPath);
KBertSession bert = KBertJava.loadSafeTensors(bertPath)) {
String text = llama.generate("Write a summary of quantum mechanics");
float[] embedding = bert.encode(text);
}Failing to close sessions will leak off-heap memory allocated via java.lang.foreign.Arena.
| Package | Key Classes |
|---|---|
sk.ainet.apps.kllama.java |
KLlamaJava, KLlamaSession, GenerationConfig |
sk.ainet.apps.bert.java |
KBertJava, KBertSession |
sk.ainet.apps.kllama.chat.java |
JavaAgentLoop, JavaTool |
- Java Getting Started -- tensor operations, project setup, and dependency management.
- Model Training Guide -- build and train neural networks from Java.