Transformers documentation

Gemma 4 Assistant

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v5.8.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

This model was released on {release_date} and added to Hugging Face Transformers on 2026-05-04.

Gemma 4 Assistant

Overview

Gemma 4 Assistant is a small, text-only model that enables speculative decoding with for Gemma 4 models using the Multi-Token Prediction (MTP) method and associated candidate generator. Pre-trained models are provided for the IT vairants of the Gemma 4 E2B, E4B, 31B and 26B-A4B (MoE) models.

Architecturally, the Gemma 4 Assistant shares the same Gemma4TextModel backbone as other Gemma 4 models, but differs in a few key ways:

  • The entire model uses KV sharing. This technique, originally introduced with Gemma 3n, allows the model to resuse the KV cache populated by the target model the assistant supports, allowing the assistant to skip the pre-fille phase entirely, and considerably reducing attention compute during the forward pass.
  • The position_ids value are constant. Since the KV cache is shared and the assistant does not have a mean of updating the cache, the assistant predicts all tokens from the same position ID.
  • Inputs are the concatenation of embeddings and hidden states. To adapt for the static KV cache and position_ids, the model takes its inputs as the concatenation of the embedding and hidden_states for the last seen token from the target model and projects them into assistant model space with a nn.Linear transform. The definition of last seen token changes throughout the assisted decoding loop. For the first token drafted after pre-fill, the last seen token will be the last token from the prompt. For subsequent drafting steps, the last seen token will be the last token generated by the assistant (within a drafting round) or the last token accepted by the target model (between drafting rounds).
  • Cross-attention is used to make the most of the target model’s context. Cross-attention allows the query states geneated by the assistant to attend to the shared KV cache values from the target model, allowing the assistant to accurately predict more drafted tokens per drafting round.

You can find all the original Gemma 4 Assistant checkpoints under the Gemma 4 release.

Usage examples

The example below demonstrates how to generate text based on an image with Pipeline or the AutoModel class.

<hfoptions id="usage"> <hfoption id="Pipeline">
import torch
from transformers import pipeline

pipeline = pipeline(
    task="image-text-to-text",
    model="google/gemma-4-E2B-it",
    assistant_model="google/gemma-4-E2B-it-assistant",
)
pipeline(
    images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
    text="<|image|>\n\nWhat is shown in this image?"
)
</hfoption> <hfoption id="AutoModel">
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model = AutoModelForImageTextToText.from_pretrained(
    "google/gemma-4-E2B-it",
    dtype=torch.bfloat16,
    device_map="auto",
)
assistant_model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-E2B-it-assistant",
    dtype=torch.bfloat16,
    device_map="auto",
)

processor = AutoProcessor.from_pretrained(
    "google/gemma-4-E2B-it",
    padding_side="left"
)
messages = [
    {
        "role": "user", "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
            {"type": "text", "text": "What is shown in this image?"},
        ]
    },
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

output = model.generate(**inputs, max_new_tokens=50, assistant_model=assistant_model)
print(processor.decode(output[0][input_len:], skip_special_tokens=True))
</hfoption>

Gemma4AssistantConfig

class transformers.Gemma4AssistantConfig

< >

( transformers_version: str | None = None architectures: list[str] | None = None output_hidden_states: bool | None = False return_dict: bool | None = True dtype: typing.Union[str, ForwardRef('torch.dtype'), NoneType] = None chunk_size_feed_forward: int = 0 is_encoder_decoder: bool = False id2label: dict[int, str] | dict[str, str] | None = None label2id: dict[str, int] | dict[str, str] | None = None problem_type: typing.Optional[typing.Literal['regression', 'single_label_classification', 'multi_label_classification']] = None text_config: transformers.models.gemma4.configuration_gemma4.Gemma4TextConfig | dict[str, typing.Any] | None = None backbone_hidden_size: int = 1536 use_ordered_embeddings: bool = False num_centroids: int = 2048 centroid_intermediate_top_k: int = 32 tie_word_embeddings: bool = True )

Parameters

  • text_config (Union[~models.gemma4.configuration_gemma4.Gemma4TextConfig, dict[str, Any]], optional) — The config object or dictionary of the text backbone.
  • backbone_hidden_size (int, defaults to 1536) — Hidden size of the target model this assistant was trained with.
  • use_ordered_embeddings (bool, defaults to False) — If True, uses an embedding table ordered for optimal assistant model performance that needs to be re-ordered to align with the main model.
  • num_centroids (int, defaults to 2048) — The total numer of centroids.
  • centroid_intermediate_top_k (int, defaults to 32) — The number of active centroids.
  • tie_word_embeddings (bool, optional, defaults to True) — Whether to tie weight embeddings according to model’s tied_weights_keys mapping.

This is the configuration class to store the configuration of a Gemma4 AssistantModel. It is used to instantiate a Gemma4 Assistant model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the google/gemma-4-e2b-it

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Example:

>>> from transformers import (
>>>     Gemma4AssistantConfig,
>>>     Gemma4AssistantForCausalLM,
>>>     Gemma4TextConfig,
>>> )

>>> # Initializing a Gemma 4 Text config similar to google/gemma-4-e2b-it-assistant.
>>> text_config = Gemma4TextConfig(...)

>>> # Initializing a Gemma 4 Assistant config similar to google/gemma-4-e2b-it-assistant.
>>> configuration = Gemma4AssistantConfig(text_config)

>>> # Initializing a model from the google/gemma-4-e2b-it-assistant configuration.
>>> model = Gemma4AssistantForCausalLM(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

Gemma4AssistantForCausalLM

class transformers.Gemma4AssistantForCausalLM

< >

( config: Gemma4AssistantConfig )

Parameters

  • config (Gemma4AssistantConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

A model for mutli-token prediction-based assisted decoding with Gemma 4.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

create_attention_masks

< >

( inputs_embeds attention_mask shared_kv_states )

Prepare the attention masks for the assisted model; the shared_kv_states acts as past cache in this instance.

We use bidirectional masks to account for causality

  • There is no difference for the edge case of q_len == 1 as it acts as full attention no matter what
  • SWA interprets the window as forward-looking (future) when q_idx=1 and kv>=1
    • We switch from a future to a past perspective by flipping on the kv axis
    • To account for position invariant padding, we also flip the base attention mask before initial creation

forward

< >

( input_ids: torch.Tensor | None = None inputs_embeds: torch.Tensor | None = None position_ids: torch.LongTensor | None = None attention_mask: dict[str, torch.Tensor] | None = None shared_kv_states: dict[str, tuple[torch.Tensor, torch.Tensor]] | None = None use_cache: bool | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] )

Parameters

  • input_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.

    Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

    What are input IDs?

  • inputs_embeds (torch.Tensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
  • position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1].

    What are position IDs?

  • attention_mask (dict[str, torch.Tensor] of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,
    • 0 for tokens that are masked.

    What are attention masks?

  • shared_kv_states (dict[str, torch.Tensor of shape (batch_size, 1, q_len, kv_len), optional) — A dictionary containing the computed KV values for the last layer of each layer_type in this model.
  • use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

The Gemma4AssistantForCausalLM forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import AutoTokenizer, Gemma4AssistantForCausalLM, Gemma4ForCausalLM

>>> model = Gemma4ForCausalLM.from_pretrained("google/gemma-4-e2b-it")
>>> assistant_model = Gemma4AssistantForCausalLM.from_pretrained("google/gemma-4-e2b-it-assistant")
>>> tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-e2b-it")

>>> prompt = "What is your favorite condiment?"
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, assistant_model=assistant_model, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True)[0]
"What is your favorite condiment?"
Update on GitHub