After showing impressive efficiency with Gemma 3, running powerful AI on a single GPU, Google has pushed the boundaries even further with Gemma 3n. This new release brings state-of-the-art AI to mobile and edge devices, using minimal memory while delivering fast, multimodal performance. In this article, we’ll explore what makes Gemma 3n so powerful, how it works under the hood with innovations like Per-Layer Embeddings (PLE) and MatFormer architecture, and how to access Gemma 3n easily using Google AI Studio. If you’re a developer looking to build fast, smart, and lightweight AI apps, this is your starting point.
What is Gemma 3n?
Gemma 3 showed us that powerful AI models can run efficiently, even on a single GPU, while outperforming larger models like DeepSeek V3 in chatbot Elo scores with significantly less compute. Now, Google has taken things further with Gemma 3n, designed to bring state-of-the-art performance to even smaller, on-device environments like mobile phones and edge devices.
To make this possible, Google partnered with hardware leaders like Qualcomm, MediaTek, and Samsung System LSI, introducing a new on-device AI architecture that powers fast, private, and multimodal AI experiences. The “n” in Gemma 3n stands for nano, reflecting its small size yet powerful capabilities.
This new architecture is built on two key innovations:
- Per-Layer Embeddings (PLE): Innovated by Google DeepMind to reduces memory usage by caching and managing layer-specific data outside the model’s main memory. It enables larger models (5B and 8B parameters) to run with just 2GB to 3GB of RAM, similar to 2B and 4B models.
- MatFormer (Matryoshka Transformer): A nested model architecture that allows smaller sub-models to function independently within a larger model. This gives developers flexibility to choose performance or speed without switching models or increasing memory usage.
Together, these innovations make Gemma 3n efficient for running high-performance, multimodal AI on low-resource devices.
How Does PLE Increase Gemma 3n’s Performance?
When Gemma 3n models are executed, Per-Layer Embedding (PLE) settings are employed to generate data that improves each model layer’s performance. As each layer executes, the PLE data can be created independently, outside the model’s working memory, cached to quick storage, and then incorporated to the model inference process. By preventing PLE parameters from entering the model memory space, this method lowers resource usage without sacrificing the quality of the model’s response.
Gemma 3n models are labeled with parameter counts like E2B and E4B, which refer to their Effective parameter usage, a value lower than their total number of parameters. The “E” prefix signifies that these models can operate using a reduced set of parameters, thanks to the flexible parameter technology embedded in Gemma 3n, allowing them to run more efficiently on lower-resource devices.
These models organize their parameters into four key categories: text, visual, audio, and per-layer embedding (PLE) parameters. For instance, while the E2B model normally loads over 5 billion parameters during standard execution, it can reduce its active memory footprint to just 1.91 billion parameters by using parameter skipping and PLE caching, as shown in the following image:
Key Features of Gemma 3n
Gemma 3n is finetuned for device tasks:
- This is the model’s capacity to use user input to initiate or call specific operations directly on the device, such as launching apps, sending reminders, turning on a flashlight, etc. It enables the AI to do more than just respond; it can also communicate with the device itself.
- Gemma 3n can comprehend and react to inputs that combine text and graphics if they are interleaved. For instance, the model can handle both when you upload an image and ask a text inquiry about it.
- For the first time in the Gemma family, it has the ability to comprehend both audio and visual inputs. Audio and video were not supported by earlier Gemma models. Gemma 3n is now able to view videos and listen to sound in order to comprehend what is happening, such as recognizing actions, detecting speech, or responding to inquiries based on a video clip.
This allows the model to interact with the environment and allows users to naturally interact with applications. Gemma 3n is 1.5 times faster than Gemma 3 4B on mobile. This increases the fluidity in the user experience (Overcomes the generation latency in LLMs).
Gemma 3n has a smaller submodel as a unique 2 in 1 matformer architecture. This lets users dynamically choose performance and speed as necessary. And to do this we do not have to manage a separate model. All this happens in the same memory footprint.
How MatFormer Architecture Helps?
A Matryoshka Transformer or MatFormer model architecture, which consists of nested smaller models inside a bigger model, is used by Gemma 3n models. It is possible to make inferences using the layered sub-models without triggering the enclosing models’ parameters while reacting to queries. Running only the smaller, core models inside a MatFormer model helps lower the model’s energy footprint, response time, and compute cost. The E2B model’s parameters are included in the E4B model for Gemma 3n. You can also choose settings and put together models in sizes that fall between 2B and 4B with this architecture.
How to Access Gemma 3n?
Gemma 3n preview is available in Google AI Studio, Google GenAI SDK and MediaPipe (Huggingface and Kaggle). We will access Gemma 3n using Google AI Studio.

- Step 1: Login to Google AI studio
- Step 2: Click on the Get API key

- Step 3: Click on the Create API key

- Step 4: Select a project of your choice and click on Create API Key

- Step 5: Copy the API and save it for further use to access Gemma 3n.
- Step 6: Now that we have the API Lets spin up a colab instance. Use colab.new in the browser to create a new notebook.
- Step 7: Install dependencies
!pip install google-genai
Step 8: Use secret keys in colab to store GEMINI_API_KEY, enable the notebook access as well.

- Step 9: Use the below code to set environment variables:
from google.colab import userdata
import os
os.environ["GEMINI_API_KEY"] = userdata.get('GEMINI_API_KEY')
- Step 10: Run the below code to infer results from Gemma 3n:
import base64
import os
from google import genai
from google.genai import types
def generate():
client = genai.Client(
api_key=os.environ.get("GEMINI_API_KEY"),
)
model = "gemma-3n-e4b-it"
contents = [
types.Content(
role="user",
parts=[
types.Part.from_text(text="""Anu is a girl. She has three brothers. Each of her brothers has the same two sisters. How many sisters does Anu have?"""),
],
),
]
generate_content_config = types.GenerateContentConfig(
response_mime_type="text/plain",
)
for chunk in client.models.generate_content_stream(
model=model,
contents=contents,
config=generate_content_config,
):
print(chunk.text, end="")
if __name__ == "__main__":
generate()
Output:

Also Read: Top 13 Small Language Models (SLMs)
Conclusion
Gemma 3n is a big leap for AI on small devices. It runs powerful models with less memory and faster speed. Thanks to PLE and MatFormer, it’s efficient and smart. It works with text, images, audio, and even video all on-device. Google has made it easy for developers to test and use Gemma 3n through Google AI Studio. If you’re building mobile or edge AI apps, Gemma 3n is definitely worth exploring. Checkout Google AI Edge to run the Gemma 3n Locally.
Login to continue reading and enjoy expert-curated content.