fbpx
Hugging Face’s Cosmopedia Hopes To Reshape Pre-Training Data Hugging Face’s Cosmopedia Hopes To Reshape Pre-Training Data
Since the beginning of AI models, the creation of datasets for supervised and instruction-tuning of AI models relied on the painstaking... Hugging Face’s Cosmopedia Hopes To Reshape Pre-Training Data

Since the beginning of AI models, the creation of datasets for supervised and instruction-tuning of AI models relied on the painstaking process of hiring human annotators—a method not only time-consuming but also prohibitively expensive.

But it seems that Hugging Face is hoping to change all of that with Cosmopedia, a synthetic data creation tool that can cover hundreds of subjects with a duplicate content rate of less than 1%.  With over 25 billion tokens and 30 million files, Cosmopedia stands as the largest open synthetic dataset to date.

In-Person and Virtual Conference

September 5th to 6th, 2024 – London

Featuring 200 hours of content, 90 thought leaders and experts, and 40+ workshops and training sessions, Europe 2024 will keep you up-to-date with the latest topics and tools in everything from machine learning to generative AI and more.

 

Creating synthetic data that is both diverse and scalable is a complex undertaking. To address this, the Hugging Face team crafted over 30 million Cosmopedia prompts spanning hundreds of topics, achieving a duplicate content rate of less than 1%. This monumental effort underscores the commitment to providing an extensive, high-quality synthetic data resource.

Cosmopedia’s creation involved a dual approach: conditioning online data for scalability and curated sources for quality. The latter includes educational resources like OpenStax and Khan Academy, ensuring the production of high-caliber content.

On the other hand, the web data, making up over 80% of Cosmopedia’s prompts, utilized a method akin to RefinedWeb, organizing millions of online samples into meaningful clusters. The output of these efforts not only enriches AI training resources but also highlights the necessity of innovative solutions like decontamination pathways to ensure the integrity of synthetic data.

 

This method, akin to the one used for the Phi-1 model, involves removing potentially contaminated samples to maintain dataset purity. The implications of Cosmopedia and similar projects are profound, offering a glimpse into the future of AI development.

These advancements promise a more inclusive field, where the creation of comprehensive datasets is not confined to a privileged few but is accessible to a broader spectrum of researchers. As the AI community continues to explore and refine these methods, the potential for accelerated innovation and growth in AI capabilities seems boundless.

In-Person & Virtual Data Science Conference

October 29th-31st, 2024 – Burlingame, CA

Join us for 300+ hours of expert-led content, featuring hands-on, immersive training sessions, workshops, tutorials, and talks on cutting-edge AI tools and techniques, including our first-ever track devoted to AI Robotics!

 

For developers, researchers, and enthusiasts alike, the evolution of synthetic fine-tuning datasets represents an important moment in AI’s journey. The success of projects like Cosmopedia not only enhances the training of more sophisticated models but also paves the way for the next generation of AI advancements.

ODSC Team

ODSC Team

ODSC gathers the attendees, presenters, and companies that are shaping the present and future of data science and AI. ODSC hosts one of the largest gatherings of professional data scientists with major conferences in USA, Europe, and Asia.

1