Qwen just released 8 new models as part of its latest family – Qwen3, showcasing promising capabilities. The flagship model, Qwen3-235B-A22B, outperformed most other models including DeepSeek-R1, OpenAI’s o1, o3-mini, Grok 3, and Gemini 2.5-Pro, in standard benchmarks. Meanwhile, the small Qwen3-30B-A3B outperformed QWQ-32B which has approximately 10 times the activated parameters as the new model. With such advanced capabilities, these models prove to be a great choice for a wide range of applications. In this article, we will explore the features of all the Qwen3 models and learn how to use them to build RAG systems and AI agents.
What is Qwen3?
Qwen3 is the latest series of large language models (LLMs) in the Qwen family, consisting of 8 different models. These include Qwen3-235B-A22B, Qwen3-30B-A3B, Qwen3-32B, Qwen3-14B, Qwen3-8B, Qwen3-4B, Qwen3-1.7B, and Qwen3-0.6B. All these models are released under Apache 2.0 license, making them freely available to individuals, developers, and enterprises.
While 6 of these models are dense, meaning they actively use all the parameters during the time of inference and training, 2 of them are open-weighted:
- Qwen3-235B-A22B: A large model with 235 billion parameters, out of which 22 billion are activated parameters.
- Qwen3-30B-A3B: A smaller MoE with 30 billion total parameters and 3 billion activated parameters.
Here’s a detailed comparison of all the 8 Qwen3 models:
Models | Layers | Heads (Q/KV) | Tie Embedding | Context Length |
Qwen3-0.6B | 28 | 16/8 | Yes | 32K |
Qwen3-1.7B | 28 | 16/8 | Yes | 32K |
Qwen3-4B | 36 | 32/8 | Yes | 32K |
Qwen3-8B | 36 | 32/8 | No | 128K |
Qwen3-14B | 40 | 40/8 | No | 128K |
Qwen3-32B | 64 | 64/8 | No | 128K |
Qwen3-30B-A3B | 48 | 32/4 | No | 128K |
Qwen3-235B-A22B | 94 | 64/4 | No | 128K |
Here’s what the table says:
- Layers: Layers represent the number of transformer blocks used. It includes multi-head self-attention mechanism, feed forward networks, positional encoding, layer normalization, and residual connections. So, when I say Qwen3-30B-A3B has 48 layers, it means that the model uses 48 transformer blocks, stacked sequentially or in parallel.
- Heads: Transformers use multi-head attention, which splits its attention mechanism into several heads, each for learning a new aspect from the data. Here, Q/KV represents:
- Q (Query heads): Total number of attention heads used for generating queries.
- KV (Key and Value): The number of key/value heads per attention block.
Note: These attention heads for Key, Query, and Value are completely different from the key, query, and value vector generated by a self-attention.
Also Read: Qwen3 Models: How to Access, Performance, Features, and Applications
Key Features of Qwen3
Here are some of the key features of the Qwen3 models:
- Pre-training: The pre-training process consists of three stages:
- In the first stage, the model was pretrained on over 30 trillion tokens with a context length of 4k tokens. This taught the model basic language skills and general knowledge.
- In the second stage, the quality of data was improved by increasing the proportion of knowledge-intensive data like STEM, coding, and reasoning tasks. The model was then trained over an additional 5 trillion tokens.
- In the final stage, high quality long context data was used by increasing the context length to 32K tokens. This was done to ensure that the model can handle longer inputs effectively.
- Post-training: To develop a hybrid model capable of both step-by-step reasoning and rapid responses, a 4-stage training pipeline was implemented. This consisted of:
- Hybrid Thinking Modes: Qwen3 models use a hybrid approach to problem solving, featuring two new modes:
- Thinking Mode: In this mode, models take time by breaking a complex problem statement into small and procedural steps to solve it.
- Non-Thinking Mode: In this mode, the model provides quick results and is mostly suitable for simpler questions.
- Multilingual Support: Qwen3 models support 119 languages and dialects. This helps users from all around the world to benefit from these models.
- Improvised Agentic Capabilities: Qwen has optimized the Qwen3 models for better coding and agentic capabilities, supporting Model Context Protocol (MCP) as well.
How to Access Qwen3 Models via API
To use the Qwen3 models, we will be accessing it via API using the Openrouter API. Here’s how to do it:
- Create an account on Openrouter and go to the model search bar to find the API for that model.

- Select the model of your choice and click on ‘Create API key’ on the landing page to generate a new API.

Using Qwen3 to Power Your AI Solutions
In this section, we’ll go through the process of building AI applications using Qwen3. We will first create an AI-powered travel planner agent using the model, and then a Q/A RAG bot using Langchain.
Prerequisites
Before building some real-world AI solutions with Qwen3, we need to first cover the basic prerequisites like:
Building an AI Agent using Qwen3
In this section, we’ll be using Qwen3 to create an AI-powered travel agent that will give the major traveling spots for the city or place you are visiting. We will also enable the agent to search the internet to find updated information, and add a tool that enables currency conversion.
Step 1: Setting up Libraries and Tools
First, we will be installing and importing the necessary libraries and tools required to build the agent.
!pip install langchain langchain-community openai duckduckgo-search
from langchain.chat_models import ChatOpenAI
from langchain.agents import Tool
from langchain.tools import DuckDuckGoSearchRun
from langchain.agents import initialize_agent
llm = ChatOpenAI(
base_url="https://openrouter.ai/api/v1",
api_key="your_api_key",
model="qwen/qwen3-235b-a22b:free"
)
# Web Search Tool
search = DuckDuckGoSearchRun()
# Tool for DestinationAgent
def get_destinations(destination):
return search.run(f"Top 3 tourist spots in {destination}")
DestinationTool = Tool(
name="Destination Recommender",
func=get_destinations,
description="Finds top places to visit in a city"
)
# Tool for CurrencyAgent
def convert_usd_to_inr(query):
amount = [float(s) for s in query.split() if s.replace('.', '', 1).isdigit()]
if amount:
return f"{amount[0]} USD = {amount[0] * 83.2:.2f} INR"
return "Couldn't parse amount."
CurrencyTool = Tool(
name="Currency Converter",
func=convert_usd_to_inr,
description="Converts USD to inr based on static rate"
)
- Search_tool: DuckDuckGoSearchRun() enables the agent to use web search to get real-time information about the popular tourist spots.
- DestinationTool: Applies the get_destinations() function, which uses the search tool to get the top 3 tourist spots in any given city.
- CurrencyTool: Uses the convert_usd_to_inr() function to convert the prices from USD to INR. You can change ‘inr’ in the function to convert it to a currency of your choice.
Also Read: Build a Travel Assistant Chatbot with HuggingFace, LangChain, and MistralAI
Step 2: Creating the Agent
Now that we have initialized all the tools, let’s proceed to creating an agent that will use the tools and give us a plan for the trip.
tools = [DestinationTool, CurrencyTool]
agent = initialize_agent(
tools=tools,
llm=llm,
agent_type="zero-shot-react-description",
verbose=True
)
def trip_planner(city, usd_budget):
dest = get_destinations(city)
inr_budget = convert_usd_to_inr(f"{usd_budget} USD to INR")
return f"""Here is your travel plan:
*Top spots in {city}*:
{dest}
*Budget*:
{inr_budget}
Enjoy your day trip!"""
- Initialize_agent: This function creates an agent with Langchain using a zero-shot reaction approach, which allows the agent to understand the tool descriptions.
- Agent_type: “zero-shot-react-description” enables the agent LLM to decide which tool it should use in a certain situation without prior knowledge, by using the tool description and input.
- Verbose: Verbose enables the logging of the agent’s thought process, so we can monitor each decision that the agent makes, including all the interactions and tools invoked.
- trip_planner: This is a python function that manually calls tools instead of relying on the agent. It allows the user to select the best tool for a particular problem.
Step 3: Initializing the Agent
In this section, we’ll be initializing the agent and observing its response.
# Initialize the Agent
city = "Delhi"
usd_budget = 8500
# Run the multi-agent planner
response = agent.run(f"Plan a day trip to {city} with a budget of {usd_budget} USD")
from IPython.display import Markdown, display
display(Markdown(response))
- Invocation of agent: agent.run() uses the user’s intent via prompt and plans the trip.
Output

Building a RAG System using Qwen3
In this section, we’ll be creating a RAG bot that answers any query within the relevant input document from the knowledge base. This gives an informative response using qwen/qwen3-235b-a22b. The system would also be using Langchain, to produce accurate and context-aware responses.
Step 1: Setting up the Libraries and Tools
First, we will be installing and importing the necessary libraries and tools required to build the RAG system.
!pip install langchain langchain-community langchain-core openai tiktoken chromadb sentence-transformers duckduckgo-search
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
# Load your document
loader = TextLoader("/content/my_docs.txt")
docs = loader.load()
- Loading Documents: The “TextLoader” class of Langchain loads the document like a pdf, txt, or doc file which will be used for the Q/A retrieval. Here I’ve uploaded my_docs.txt.
- Selecting the Vector Setup: I have used ChromaDB to store and search the embeddings from our vector database for the Q/A process.
Step 2: Creating the Embeddings
Now that we’ve loaded our document, let’s proceed to creating embeddings out of it which will help in easing the retrieval process.
# Split into chunks
splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=50)
chunks = splitter.split_documents(docs)
# Embed with HuggingFace model
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
db = Chroma.from_documents(chunks, embedding=embeddings)
# Setup Qwen LLM from OpenRouter
llm = ChatOpenAI(
base_url="https://openrouter.ai/api/v1",
api_key="YOUR_API_KEY",
model="qwen/qwen3-235b-a22b:free"
)
# Create RAG chain
retriever = db.as_retriever(search_kwargs={"k": 2})
rag_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
- Document Splitting: The CharacterTextSplitter() splits the text into smaller chunks, which will mainly help in two things. First, it eases the retrieval process, and second, it helps in retaining the context from the previous chunk via chunk_overlap.
- Embedding Documents: Embeddings convert the text into the embedding vectors of a set dimension for each token. Here we have used chunk_size of 300, which means every word/token will be converted into a vector of 300 dimensions. Now this vector embedding will have all the contextual information of that word with respect to the other words in the chunk.
- RAG Chain: RAG chain combines the ChromaDB with the LLM to form a RAG. This enables us to get contextually aware answers from the document as well as from the model.
Step 3: Initializing the RAG System
# Ask a question
response = rag_chain.invoke({"query": "How can i use Qwen with MCP. Please give me a stepwise guide along with the necessary code snippets"})
display(Markdown(response['result']))
- Query Execution: The rag_chain_invoke() method will send the user’s query to the RAG system, which then retrieves the relevant context-aware chunks from the document store (vector db) and generates a context-aware answer.
Output



You can find the complete code here.
Applications of Qwen3
Here are some more applications of Qwen3 across industries:
- Automated Coding: Qwen3 can generate, debug, and provide documentation for code, which helps developers to solve errors without manual effort. Its 22B parameter model excels in coding, with performances comparable to models like DeepSeek-R1, Gemini 2.5 Pro, and OpenAI’s o3-mini.
- Education and Research: Qwen3 archives high accuracy in math, physics, and logical reasoning problem solving. It also rivals the Gemini 2.5 Pro, while excels with models such as OpenAI’s o1, o3-mini, DeepSeek-R1, and Grok 3 Beta.
- Agent-Based Tool Integration: Qwen3 also leads in AI agent tasks by allowing the use of external tools, APIs, and MCPs for multi-step and multi-agentic workflows with its tool-calling template, which further simplifies the agentic interaction.
- Advanced Reasoning Tasks: Qwen3 uses an extensive thinking capability to deliver optimal and accurate responses. The model uses chain-of-thought reasoning for complex tasks and a non-thinking mode for optimized speed.
Conclusion
In this article, we have learned how to build Qwen3-powered agentic AI and RAG systems. Qwen3’s high performance, multilingual support, and advanced reasoning capability make it a strong choice for knowledge retrieval and agent-based tasks. By integrating Qwen3 into RAG and agentic pipelines, we can get accurate, context-aware, and smooth responses, making it a strong contender for real-world applications for AI-powered systems.
Frequently Asked Questions
A. Qwen3 has a hybrid reasoning capability that allows it to make dynamic changes in the responses, which allows it to optimize the RAG workflows for both retrieval and complex analysis.
A. It majorly includes the Vector database, Embedding models, Langchain workflow and an API to access the model.
Yes, with the Qwen-agent built-in tool calling templates, we can parse and enable sequential tool operations like web searching, data analysis, and report generation.
A. One can reduce the latency in many ways, some of them are:
1. Use of MOE models like Qwen3-30B-A3B, which only have 3 billion active parameters.
2. By using GPU-optimized inferences.
A. The common error includes:
1. MCP server initialization failures, like json formatting and INIT.
2. Tool response pairing errors.
3. Context window overflow.
Login to continue reading and enjoy expert-curated content.