You can now train AI models without using real-world data
What's the story
Researchers from Tsinghua University and Microsoft have created a synthetic data pipeline for training artificial intelligence (AI) models. The innovative system, dubbed SynthSmith, leverages processors from leading US chip designer NVIDIA. The development marks a significant step in overcoming the challenge of real-world data scarcity for enhancing AI models.
Performance
SynthSmith outperforms larger models with less data
The SynthSmith pipeline was able to train an X-Coder model with seven billion parameters. This model outperformed others with 14 billion parameters on major coding benchmarks, despite using less data and none from the real world. The finding highlights the potential of synthetic data in improving AI performance, even when real-world data is scarce.
Solution
Synthetic data: A solution to real-world data scarcity
Synthetic data, which mimics real-world data, is generated by AI algorithms. As new real-world data becomes scarce, AI researchers are turning to synthetic data as a viable alternative for improving their models. The success of SynthSmith demonstrates the potential of this approach in overcoming one of the key challenges in AI development today.