Alibaba, a prominent name in the global technology landscape, has been steadily making waves in the realm of artificial intelligence, particularly with its Qwen series of large language models. Known for their robust capabilities and increasing accessibility through open-source initiatives, the Qwen models have garnered significant attention from the AI research community and industry observers alike. Recently, Alibaba announced a new family of “thinking models,” signaling a significant step forward in their pursuit of artificial general intelligence. This development promises to push the boundaries of what AI can achieve, moving beyond mere language generation towards more complex cognitive abilities.
This new chapter in the Qwen story is marked by the unveiling of two key models: QwQ-Max-Preview and QwQ-32B . The QwQ-Max-Preview, announced on February 25, 2025, is built upon the foundation of Qwen 2.5 Max and is specifically designed to excel in mathematics and coding-based tasks . Currently in its preview stage, this model is described as an advancement aimed at “pushing the boundaries of deep reasoning and versatile problem-solving” . The simultaneous announcement of QwQ-32B, on March 6, 2025, further underscores Alibaba’s commitment to this new direction. QwQ-32B is presented as a compact reasoning model, boasting only 32 billion parameters, yet delivering performance that rivals much larger, cutting-edge models . The strategic release of both a preview version and a more readily available, open-source model suggests a comprehensive approach to introducing this new “thinking” capability to the world. The preview allows for early engagement and feedback, while the open-source option encourages immediate experimentation and adoption within the AI community .

Unpacking the “Thinking”: What Makes This Model Different?

The very notion of a “thinking model” suggests a departure from traditional language models that primarily focus on generating text. Qwen’s approach to this concept is uniquely articulated in the official blog post for QwQ-32B-Preview . Here, QwQ is portrayed as an entity that approaches problems with “genuine wonder and doubt,” embodying a philosophical spirit akin to an “eternal student of wisdom.” This involves a process of questioning its own assumptions, exploring various lines of thought, and persistently seeking a deeper understanding before arriving at an answer . This description implies a more active and introspective form of intelligence, going beyond simply processing information to actively engaging with it in an analytical manner.
Underpinning this “thinking” process is a robust technological foundation. QwQ-Max-Preview is built upon Qwen 2.5 Max , and QwQ-32B is constructed using Qwen2.5-32B . A key architectural component of Qwen 2.5 Max is its Mixture of Experts (MoE) design . In this architecture, the model comprises smaller, specialized “expert” networks, each focusing on particular aspects of language or knowledge. A “gating network” then acts as an intelligent router, analyzing incoming requests and activating only the most relevant experts for the specific task at hand . This selective activation allows for greater model capacity and efficiency, likely contributing to the model’s ability to handle complex reasoning challenges. Furthermore, Qwen 2.5 Max has been trained on an immense dataset of over 20 trillion tokens . This massive exposure to a wide range of text and code provides a strong foundational understanding of common sense, expert knowledge, and reasoning capabilities necessary for advanced cognitive tasks.
The training methodology for QwQ-32B also plays a crucial role in its “thinking” abilities. It utilizes a sophisticated two-stage reinforcement learning (RL) approach . In the first stage, the model is trained on mathematics and coding tasks using verifiable rewards. For math problems, an accuracy verifier checks the correctness of the solutions, while for coding tasks, the generated code is executed through a server with predefined test cases . This ensures that the model receives strong, objective feedback on its performance in domains where correctness can be definitively assessed. Following this initial stage, a second phase of RL is employed using a general reward model. This stage focuses on enhancing the model’s general capabilities, such as instruction following, alignment with human preferences, and overall agent performance, without compromising the strong reasoning skills developed in the first stage . This carefully designed training process, particularly the use of verifiable rewards, appears to be a key factor in eliciting the desired “thinking” behavior in Qwen’s new models.

Key Capabilities and Performance Highlights

The initial announcements and benchmark results indicate that Qwen’s new thinking models possess impressive capabilities, particularly in areas requiring deep reasoning. QwQ-Max-Preview is explicitly designed to “push the boundaries of deep reasoning and versatile problem-solving,” with a specific focus on excelling in mathematics and coding-based tasks . Demonstrating its prowess in coding, QwQ-Max-Preview achieved a score of 65.6 on the LiveCodeBench leaderboard . This score surpasses that of OpenAI’s o1 medium (63.4) and o3 Mini Low (60.9), suggesting a competitive edge in code generation and understanding against established models .
Similarly, QwQ-32B has shown remarkable performance, often comparable to models significantly larger in scale, such as DeepSeek R1 . In fact, QWQ 32B achieved a score of 79.5 on LiveCodeBench, closely matching DeepSeek R1’s 79.8 . Furthermore, QWQ 32B outperformed DeepSeek on the LiveBench benchmark and scored six points higher on BFCL . These results highlight the efficiency of Qwen’s training methodologies, enabling a smaller model with 32 billion parameters to rival the performance of DeepSeek R1, which boasts 671 billion parameters. This achievement suggests that Qwen has made significant strides in optimizing model size without sacrificing crucial performance metrics. The preview version, QwQ-32B-Preview, has also demonstrated strong analytical and problem-solving capabilities in complex technical domains, achieving impressive scores of 65.2% on GPQA (a benchmark for graduate-level scientific reasoning), 50.0% on AIME (a challenging mathematical problem-solving competition), and 90.6% on MATH-500 (which assesses exceptional mathematical reasoning) .
Beyond excelling in traditional benchmark tasks, Qwen’s new models also demonstrate significant capabilities in agent-related workflows and tool usage. QwQ-Max-Preview is noted for its “outstanding performance in Agent-related workflows” . Similarly, the research team behind QwQ-32B has integrated agent capabilities into the model, allowing it to think critically, utilize tools effectively, and adapt its reasoning based on environmental feedback . This emphasis on agent capabilities indicates an ambition to develop AI systems that can not only understand and reason but also interact with the world and perform actions autonomously. QWEN-32B, in particular, is highlighted as being well-suited for agent-based tasks and tool calling due to its inherent reasoning abilities and efficiency .
While the primary focus of the “thinking model” announcement is on reasoning abilities, it is worth noting the advancements in other modalities within the broader Qwen family. Qwen2.5-VL, for instance, features improved image and video generation capabilities, along with agent functionality . This model can analyze images, text, charts, and even comprehend videos exceeding one hour in length . Notably, Qwen2.5-VL has outperformed GPT-4o in understanding documents, diagrams, and videos . While these multimodal capabilities are not the central theme of the new “thinking model,” they suggest a comprehensive approach to AI development within Alibaba, potentially leading to future models that seamlessly integrate reasoning with multimodal understanding.

Democratizing AI: Open Source and Accessibility

A significant aspect of Alibaba’s strategy with the Qwen series is its commitment to open-source principles. The company has announced the open-sourcing of QwQ-Max and smaller variants, including QwQ-32B, under the permissive Apache 2.0 license . This decision to make these advanced models openly available offers numerous benefits to developers, researchers, and the broader AI community. Open-sourcing fosters collaboration, transparency, and accelerates the pace of innovation as the community can freely experiment with, fine-tune, and extend these models for a wide range of specialized applications . This collaborative environment can lead to the rapid identification of limitations and the development of improvements that might not be possible within a closed, proprietary system.
The availability of smaller models like QwQ-32B is particularly noteworthy for their potential in local device deployment and privacy-sensitive applications . These models retain robust reasoning capabilities while minimizing computational demands, making them suitable for integration into devices with limited resources. This is especially valuable for applications where low latency is critical or where data privacy concerns necessitate processing data locally rather than in the cloud. QwQ-32B, for example, can be efficiently deployed on consumer-grade hardware due to its significantly reduced deployment costs . Its speed, reaching 450 tokens per second on Groq, further enhances its practicality for real-time applications . To further facilitate access, Qwen models are available on popular AI platforms such as Hugging Face and Model Scope . This ease of access lowers the barrier to entry for developers and researchers to begin utilizing and experimenting with Qwen’s cutting-edge technology.

Introducing the Qwen Chat App: AI for Everyone?

In a move to broaden the accessibility of its advanced AI capabilities, Alibaba has announced the launch of a dedicated Qwen Chat app . This application is designed to provide a user-friendly interface that will allow individuals without deep technical expertise to interact with the power of the Qwen “thinking model.” The app will enable seamless interaction for tasks such as problem-solving, code generation, and logical reasoning . Prioritizing real-time responsiveness and integration with popular productivity tools, the Qwen Chat app aims to make advanced AI accessible to a global audience . By bridging the gap between powerful AI and everyday users, this initiative has the potential to democratize access to sophisticated reasoning tools, empowering individuals to leverage AI for a variety of tasks in their personal and professional lives. If the app delivers on its promise of intuitive usability and real-time performance, it could indeed mark a significant step towards making advanced AI a tool for everyone.

Performance Benchmarks in Detail

To provide a clearer picture of the performance of Qwen’s new models, the following table summarizes key benchmark results against competitors, based on the available research material:

Model LiveCodeBench Arena-Hard LiveBench GPQA-Diamond AIME MATH-500 Parameters
QwQ-Max-Preview 65.6 N/A N/A N/A N/A N/A N/A
QwQ-32B High (79.5) N/A High N/A High High 32B
Qwen 2.5-Max 38.7 89.4 62.2 60.1 N/A N/A MoE
DeepSeek-V3 37.6 85.5 60.5 59.1 N/A N/A N/A
DeepSeek R1 79.8 N/A High 71.0 High N/A 671B
OpenAI o1 medium 63.4 N/A N/A N/A N/A N/A N/A
OpenAI o3 Mini Low 60.9 N/A N/A N/A N/A N/A N/A
QwQ-32B-Preview N/A N/A N/A 65.2% 50.0% 90.6% 32B

Note: “High” indicates comparable or superior performance based on qualitative descriptions in the snippets. Parenthetical values for QwQ-32B in LiveCodeBench are based on . The “Parameters” column is added for context.
This table provides a consolidated view of how Qwen’s models perform on various benchmarks that assess different aspects of language understanding, reasoning, mathematics, and coding capabilities. The data suggests that Qwen is competitive with, and in some cases outperforms, models from other leading AI developers in specific areas.

Potential Limitations and Future Directions

Despite the promising capabilities of Qwen’s new thinking models, the research material also points to certain limitations. QwQ-32B-Preview, being an experimental research model, may exhibit language mixing and code-switching, unexpectedly alternating between languages, which could affect the clarity of its responses . It also has the potential to enter recursive reasoning loops, leading to lengthy responses that may not reach a conclusive answer . Furthermore, being a preview release, it requires enhanced safety measures to ensure reliable and secure performance, and users are advised to exercise caution during deployment . While it demonstrates strong performance in math and coding, there is room for improvement in other areas such as common sense reasoning and nuanced language understanding .
Another potential limitation is the context window size of QWQ 32B, reported to be 128k tokens . While substantial, this might be smaller than some other leading models, potentially limiting its ability to handle very long and complex reasoning tasks or maintain coherence over extended outputs . Additionally, it’s important to consider the potential for biases in Qwen models. As mentioned in the context of Qwen2.5-VL, regulatory compliance in China might lead to certain sensitivities, as observed when the chatbot displayed an error message for prompts about political leaders . Users should be mindful of such potential biases, especially when deploying these models in diverse contexts.
Looking ahead, Alibaba appears to be heavily invested in the future of AI and the continued development of the Qwen series. The company has announced plans to invest over $52 billion in the cloud computing and artificial intelligence sector over the next three years . This significant financial commitment underscores their long-term vision and ambition to be at the forefront of AI innovation. Such substantial investment suggests that we can expect further advancements, improvements, and new capabilities in the Qwen family of models, including the “thinking” models, in the years to come.

Conclusion

Alibaba’s unveiling of its new “thinking” AI models, QwQ-Max-Preview and QwQ-32B, marks a significant step in the evolution of the Qwen series and the broader landscape of large language models. These models demonstrate impressive capabilities, particularly in deep reasoning, mathematics, and coding, often outperforming established models from competitors like OpenAI and DeepSeek on key benchmarks. The emphasis on agent-related tasks further highlights their potential for practical applications in automation and intelligent systems.
The commitment to open-source principles, especially with the release of QwQ-32B, democratizes access to these advanced AI technologies, fostering collaboration and innovation within the AI community. The development of the Qwen Chat app signals an effort to bring the power of these “thinking” models to a wider, non-technical audience, potentially transforming how individuals interact with AI in their daily lives.
While certain limitations, such as language mixing in the preview model and potential biases, need to be considered, the overall trajectory is promising. Alibaba’s substantial investment in AI research and development indicates a continued dedication to pushing the boundaries of what AI can achieve. The Qwen “thinking” models represent a significant advancement towards more sophisticated and versatile artificial intelligence, and their future development will be closely watched by the global tech community.