are increasingly investigating what's known as "synthetic data" to train their large language models for a number of reasons, not least of which being that it's apparently more cost-effective.Beyond the relative cheapness of synthetic data, however, is the scale issue. Training cutting-edge LLMs starts to use essentially all the human-created data that's actually available, meaning that to build even stronger ones, they're almost certainly going to need more.
"If you could get all the data that you needed off the web, that would be fantastic," Gomez said. "In reality, the web is so noisy and messy that it’s not really representative of the data that you want. The web just doesn’t do everything we need."As the CEO noted, Cohere and other companies are already quietly using synthetic data to train their LLMs "even if it’s not broadcast widely," and others like OpenAI seem to expect to use it in the future.
During an event in May, OpenAI CEO Sam Altman quipped that he is "pretty confident that soon all data will be synthetic data," the report notes, and