You can now train AI models without using real-world data

By Dwaipayan Roy

Jan 26, 2026

05:04 pm

What's the story

Researchers from Tsinghua University and Microsoft have created a synthetic data pipeline for training artificial intelligence (AI) models. The innovative system, dubbed SynthSmith, leverages processors from leading US chip designer NVIDIA. The development marks a significant step in overcoming the challenge of real-world data scarcity for enhancing AI models.

Performance

SynthSmith outperforms larger models with less data

The SynthSmith pipeline was able to train an X-Coder model with seven billion parameters. This model outperformed others with 14 billion parameters on major coding benchmarks, despite using less data and none from the real world. The finding highlights the potential of synthetic data in improving AI performance, even when real-world data is scarce.

Solution

Synthetic data: A solution to real-world data scarcity

Synthetic data, which mimics real-world data, is generated by AI algorithms. As new real-world data becomes scarce, AI researchers are turning to synthetic data as a viable alternative for improving their models. The success of SynthSmith demonstrates the potential of this approach in overcoming one of the key challenges in AI development today.