Large Models in Action: Transforming Industries

The race in generative AI technology has gained immense traction, entering what can only be described as its zenith phase. This year, OpenAI has unveiled several noteworthy innovations including models such as Sora, GPT-4o, and the o1 series, captivating the tech community across the globe. Not far behind, other companies like Runway have also made significant strides with their latest image model, Framer. Meanwhile, Midjourney is gearing up to showcase version 7 of their renowned model, and Claude 3.5 is set for an upgrade. On the hardware front, NVIDIA has announced its latest AI audio model, Fugatto.

The advancements are not confined to international giants; Asian tech firms are equally making waves. Companies such as ByteDance, Baidu, and Tencent have reported significant developments in their large models, with a focus on leveraging these technologies to enhance cloud services and thus create added value.

As the momentum intensifies, the landscape for start-ups dedicated to large models is also shifting rapidly. A vivid illustration of this is the recent emergence of StepFun, a company zeroed in on developing Artificial General Intelligence (AGI) models. On November 27, 2023, StepFun quietly initiated internal testing for Step-Video, a video generation model that allows users to apply through their "Leap Inquiry" website. Additionally, the development of the second version of this model is already underway.

This low-profile but ambitious start-up has accomplished a remarkable feat by launching at least six foundational models within just eight months, marking its strong presence on the international stage. Within a week, their multimodal understanding model, Step-1V, and trillion-parameter language model, Step-2, have secured positions at the forefront of global evaluations, particularly making headlines in authoritative assessments such as the LMSYS chatbot arena and LiveBench, asserting the lead among Chinese models.

In these evaluations, Step-1V achieved performance metrics equaling that of Gemini-1.5-Flash-8B-Exp-0827 in the LMSYS Chatbot Arena. Meanwhile, Step-2's results were closely approaching those of OpenAI's o1-mini-2024-09-12, surpassing other mainstream international models like gpt-4o-2024-08-06. Notably, Step-2 stands as the sole Chinese language model to feature within the top ten of these rankings.

As we approach December 1, 2023, the tech world will commemorate the two-year anniversary of the AI chatbot, ChatGPT, which sparked a fresh global enthusiasm for the development of AI models. Reports reveal that the total number of AI models has soared to 1,328 worldwide, with China accounting for 36%—solidifying positions within the forefront of the industry.

The current competitive framework of the AI model market is becoming increasingly fierce. Among these players, start-ups have often taken the lead, especially StepFun, which was founded merely in April 2023 and has quickly acquired an edge in comprehensive technical capabilities within a mere span of 600 days.

This cutting-edge company has introduced the Step series—an expansive array of models capable of handling everything from understanding to generation, including written and multimodal tasks. The Step-1 model boasts 100 billion parameters and has quickly showcased its prowess by outperforming GPT-3.5 in several areas including logical reasoning and knowledge both in Chinese and English.

The Step-1V model, categorized as multimodal, has achieved performance equivalent to GPT-4V by accurately interpreting and describing various forms of information in images, which paves the way for new tasks such as content creation, logical reasoning, and data analysis. Fast forward, the Step-2, with a trillion parameters, is distinguished as the first release from a start-up leveraging the MoE architecture, focusing on advanced exploratory depth in intelligence.

This model inventively excels in language generation while maintaining stringent control over details—allowing for improved understanding and adherence to human instructions. Step-1.5V has iterated on Step-1V, enhancing its multifaceted understanding capabilities, which now encompass interpreting and generating video content. Lastly, the Step-Video video generation model stands out with its ability to transform text into video, producing 10-second 1080P clips efficiently, thus marking another significant milestone.

Considering these advancements, it is evident that StepFun is a formidable player particularly within the "six small tigers" of large models, gaining a reputation for its strong multimodal model technology. Their founder, CEO Jiang Daxin, outlines an ambitious trajectory towards achieving AGI, initiating from single-modality to multimodal systems, leading up to a unified understanding and generation model that will finally establish a world model driving towards AGI.

Jiang emphasizes that to truly build a world model, the integration of multimodal understanding and generation is crucial, paving the way for embodied intelligence which will ultimately lead to AGI—enhancing societal capabilities and economic value.

Predictions from research firm IDC indicate that by 2028, global spendings on AI technologies may reach an astounding $632 billion, almost doubling the current expenditure and indicating a compound annual growth rate (CAGR) of 29% over the next five years. The explosive growth of generative AI is projected to be a significant contributor to this boom, estimated to attract investments of $202 billion, which would represent a 32% share of the total AI expenditures.

However, it must be acknowledged that the generative AI industry is still in its infancy. Academia figures such as Gao Wen, an academician at the Chinese Academy of Engineering and professor at Peking University, liken AGI to a toddler just learning to walk. Yet, from a usability standpoint, AI is already adept at addressing crucial production, social, and service-oriented issues. There is no need to await a flawless model—incremental development, enhancement, and iteration are the logical steps forward.

Increasingly, developers and enterprises are now harnessing the power of StepFun's model array to create a variety of AI applications. The burgeoning open platform is gradually evolving into an "ecological partnership circle" for large models, facilitating collaborations between top-tier institutions across finance, media, and entertainment. A notable example is the partnership between Financial Alliance and StepFun, resulting in China's first trillion-parameter multimodal financial model, named “Cai Yue F1.”

Developers are eager to explore novel product forms through the Step series. Individual developer Zhao Chun has integrated three products, including the trending AI application “Wei Zhi Shu,” with StepFun’s models, having discovered through rigorous A/B testing that StepFun's model produces the highest rate of user engagement. Similarly, the AI-based mental health support application “Lin Jian Liao Yu Shi” has introduced its long-awaited image recognition feature, leveraging StepFun’s multimodal advancements to enhance user interaction, which has notably increased its revenue metrics.

Looking ahead, it is evident that companies like StepFun that steadfastly pursue foundational technological explorations toward AGI and expedite the real-world application of their models will be pivotal drivers in the unfolding AGI era. Moreover, they are anticipated to take the lead in the AGI technology landscape.

In this brave new world of intelligent leaps, the possibilities for each individual may grow tenfold.

Related reads

Two Decades of Car Connectivity: Trends and Trajectories

Large Action Models: The AI That Executes Your Commands

State of Agentic AI: Capabilities, Challenges, and Real-World Applications

Digital Twin: From Concept to Reality

The "Glory Days" of the Argentine Peso

AI Smart Glasses: Overcoming Development Hurdles