CEO & Co-Founder of Databricks, Ali Ghodsi took to LinkedIn to introduce to the world, Dolly 2.0 – the world’s first open-source LLM that is instruction-following and fine-tuned on a human-generated instruction dataset licensed for commercial use.
In a blog post, Databricks opened up about Dolly 2.0. According to their post, Dolly 2.0 is capable of following instructions, enabling organizations to build, own and customize LLMs for their specific needs. This means, that if a company wants to use an LLM for sentiment analysis of customer reviews, they don’t have to start from the foundations. With Dolly, they could start with a pre-trained LLM and fine-tune it on a data set of customer reviews.
Dolly 2.0 is a 12-billion parameter model based on the EleutherAI pythia model and has been fine-tuned exclusively on a new, high-quality human-generated instruction-following dataset, called databricks-dolly-15k. This is the first open-source, human-generated instruction dataset specifically designed for making LLMs exhibit the human-like interactivity of ChatGPT. Databricks made the dataset, the training code, and the model weights available to anyone for commercial use under the Creative Commons Attribution-ShareAlike 3.0 Unported License.
Databricks received several requests to use its LLMs commercially after releasing Dolly 1.0, which was trained using a dataset created by the Stanford Alpaca team with the OpenAI API. However, this dataset contained output from ChatGPT, and its terms of service prevent anyone from creating a model that competes with OpenAI. Therefore, Dolly 1.0 was limited to non-commercial use. To overcome this limitation, Databricks created its dataset, crowdsourcing it among its employees during March and April 2023.
Databricks set up a contest to create a high-quality dataset, offering a big award to the top 20 labelers. Databricks employees completed seven specific tasks: Open Q&A, Closed Q&A, Extract information from Wikipedia, Summarize information from Wikipedia, Brainstorming, Classification, and Creative writing. Each task helped Databricks create an original, high-quality dataset that avoids contamination from pre-existing information.
The databricks-dolly-15k dataset contains 15,000 human-generated prompt/response pairs specifically designed for instruction-following, ranging from brainstorming and content generation to information extraction and summarization. By making Dolly 2.0 open-source, Databricks aims to democratize access to LLMs, enabling organizations to build customized models without paying for API access or sharing data with third parties.