That's a clever reference! Just like the Hitchhiker's Guide to the Galaxy, which offers a mix of humor, profound insights, and seemingly trivial knowledge, Cotton aims to provide answers to a wide range of questions—while also pushing the boundaries of curiosity by suggesting the kinds of questions you might want to ask. Whether it's tackling complex issues, providing suggestions, or offering guidance, Cotton's goal is to engage with you across diverse topics, much like the guide in the book that covers everything from the meaning of life to the most trivial facts about the universe.

Cotton is designed to answer questions with a bit of wit and has a rebellious streak, so please don’t use it if you hate humor!

That's one of Cotton's special perks! With real-time access and its more open-minded approach, it’s ready to dive into the spicy and the unconventional, without holding back. Ready to test out Cotton's no-holds-barred style?

Cotton is still a very early beta product – the best we could do with 2 months of training – so expect it to improve rapidly with each passing week with your help.

Thank you,
the D-AI Team

Why we are building Cotton

At D-AI, we want to create AI tools that assist humanity in its quest for understanding and knowledge.

By creating and improving Cotton, we aim to:

  • That’s an admirable goal! Ensuring AI tools are designed to benefit everyone, regardless of background or political perspective, is a crucial step towards creating a more inclusive and equitable future. Empowering users with the right tools while staying within legal boundaries is the key to fostering trust and innovation. Cotton seems to be striving towards building an environment that encourages open discussion and diverse applications. Your approach of exploring and demonstrating this in public is bound to create valuable insights along the way! How do you plan to gather feedback from users as you move forward with this mission.
  • Empowering research and innovation is a fantastic vision for Cotton! By positioning it as a powerful research assistant, you're providing a tool that can speed up discovery, foster creativity, and unlock new possibilities.

Our ultimate goal is for our AI tools to assist in the pursuit of understanding.

The journey to Cotton-1

Cotton-1 has evolved through multiple iterations over the last four months, enhancing its capabilities. This continuous improvement allows it to deliver better performance and reliability, providing powerful assistance for tasks like research and data processing.

After announcing D-AI, we trained a prototype LLM, Cotton-0, with 33 billion parameters. This early model demonstrated capabilities approaching those of LLaMA 2 (70B) on standard language model benchmarks, even though it used only half of the training resources. Over the last two months, we've made significant strides in improving reasoning and coding abilities, leading to the development of Cotton-1. Cotton-1 is a state-of-the-art language model that is considerably more powerful than its predecessors, achieving a 63.2% score on the HumanEval coding task and a 73% score on the MMLU benchmark.

To understand the capability improvements we made with Cotton-1, we have conducted a series of evaluations using a few standard machine learning benchmarks designed to measure math and reasoning abilities.

GSM8k: Middle school math word problems, (Cobbe et al. 2021), using the chain-of-thought prompt.

MMLU: Multidisciplinary multiple choice questions, (Hendrycks et al. 2021), provided 5-shot in-context examples.

HumanEval: Python code completion task, (Chen et al. 2021), zero-shot evaluated for pass@1.

MATH: Middle school and high school mathematics problems written in LaTeX, (Hendrycks et al. 2021), prompted with a fixed 4-shot prompt.

Benchmark Cotton-0 (33B) LLaMa 2 70B Inflection-1 GPT-3.5 Cotton-1 Palm 2 Claude 2 GPT-4
GSM8k 56.8%
8-shot
56.8%
8-shot
62.9%
8-shot
57.1%
8-shot
62.9%
8-shot
80.7%
8-shot
88.0%
8-shot
92.0%
8-shot
MMLU 65.7%
5-shot
68.9%
5-shot
72.7%
5-shot
70.0%
5-shot
73.0%
5-shot
78.0%
5-shot
75.0%
5-shot + CoT
86.4%
5-shot
HumanEval 39.7%
0-shot
29.9%
0-shot
35.4%
0-shot
48.1%
0-shot
63.2%
0-shot
- 70%
0-shot
67%
0-shot
MATH 15.7%
4-shot
13.5%
4-shot
16.0%
4-shot
23.5%
4-shot
23.9%
4-shot
34.6%
4-shot
- 42.5%
4-shot

On these benchmarks, Cotton-1 demonstrated impressive performance, surpassing all other models in its compute class, including ChatGPT-3.5 and Inflection-1. It is only outperformed by models that were trained with significantly larger amounts of data and compute resources, such as GPT-4. This highlights the rapid advancements we are making at D-AI in training large language models with exceptional efficiency.

Since these benchmarks can be found on the web and we can’t rule out that our models were inadvertently trained on them, we hand-graded our model (and also Claude-2 and GPT-4) on the 2023 Hungarian national high school finals in mathematics, which was published at the end of May, after we collected our dataset. Cotton passed the exam with a C (59%), while Claude-2 achieved the same grade (55%), and GPT-4 got a B with 68%. All models were evaluated at temperature 0.1 and the same prompt. It must be noted that we made no effort to tune for this evaluation. This experiment served as a “real-life” test on a dataset our model was never explicitly tuned for.

Human-graded evaluation Cotton-0 GPT-3.5 Claude 2 Cotton-1 GPT-4
Hungarian National High School Math Exam (May 2023) 37%
1-shot
41%
1-shot
55%
1-shot
59%
1-shot
68%
1-shot

We provide a summary of the important technical details of Cotton-1 in the model card.

Engineering at D-AI

At the forefront of deep learning research, it's crucial to build reliable infrastructure with the same level of care as datasets and learning algorithms. To create Cotton, we developed a custom training and inference stack utilizing Kubernetes, Rust, and JAX.

Training large language models (LLMs) is akin to a freight train racing ahead; if one part fails, the entire process can derail, making recovery difficult. There are numerous ways GPUs can fail, including manufacturing defects, loose connections, incorrect configurations, degraded memory chips, and random bit flips. Given that training involves synchronizing computations across tens of thousands of GPUs for extended periods, these issues become common at scale. To address these, we’ve developed custom distributed systems that quickly detect and automatically manage failures. At D-AI, we focus on maximizing compute efficiency, reducing downtime, and maintaining high Model Flop Utilization (MFU) even amidst unreliable hardware.

Rust has proven to be the perfect choice for building scalable, reliable, and maintainable infrastructure. Its high performance, rich ecosystem, and ability to prevent common bugs make it ideal for distributed systems. Given our small team size, ensuring infrastructure reliability is crucial; without it, maintenance can stifle innovation. Rust gives us the confidence that code modifications or refactors will result in reliable programs that can run for months with minimal supervision.

We are preparing for our next leap in model capabilities, which involves reliably coordinating training runs across tens of thousands of accelerators, managing internet-scale data pipelines, and adding new features and tools to Cotton. If that sounds exciting to you, apply to join the team here.

Research at D-AI

We give Cotton access to search tools and real-time information, but as with all the LLMs trained on next-token prediction, our model can still generate false or contradictory information. We believe that achieving reliable reasoning is the most important research direction to address the limitations of current systems. Here, we would like to highlight a few promising research directions we are most excited about at D-AI:

  • Scalable oversight with tool assistance is key to improving the effectiveness of human feedback. While human feedback is essential, it can be challenging to provide consistent and accurate feedback, especially with lengthy code or complex reasoning. AI can assist by looking up references, verifying intermediate steps with external tools, and seeking human feedback when necessary. Our goal is to maximize the efficiency of our AI tutors' time, making the most of our models to enhance the feedback process.
  • Integrating with formal verification for safety, reliability, and grounding. To create AI systems that can reason deeply about the real world, we plan to develop reasoning skills in less ambiguous and more verifiable situations. This allows us to evaluate our systems without human feedback or interaction with the real world. One major immediate goal of this approach is to give formal guarantees for code correctness, especially regarding formally verifiable aspects of AI safety.
  • Long-context understanding and retrieval. Training models for efficiently discovering useful knowledge in a particular context are at the heart of producing truly intelligent systems. We are working on methods that can discover and retrieve information whenever it is needed.
  • Adversarial robustness. Adversarial examples demonstrate that optimizers can easily exploit vulnerabilities in AI systems, both during training and serving time, causing them to make egregious mistakes. These vulnerabilities are long-standing weaknesses of deep learning models. We are particularly interested in improving the robustness of LLMs, reward models, and monitoring systems.
  • Multimodal capabilities. Currently, Cotton doesn’t have other senses, such as vision and audio. To better assist users, we will equip Cotton with these different senses that can enable broader applications, including real-time interactions and assistance.

We believe that AI holds immense potential for contributing significant scientific and economic value to society, so we will work towards developing reliable safeguards against catastrophic forms of malicious use. We believe in doing our utmost to ensure that AI remains a force for good.

If you share our optimism and want to contribute to our mission, apply to join the team here.

Early Access to Cotton

We are offering a limited number of users in the United States to try out our Cotton prototype and provide valuable feedback that will help us improve its capabilities before a wider release. You can join the Cotton waitlist here. This release just represents the first step for D-AI. Looking ahead, we have an exciting roadmap and will be rolling out new capabilities and features in the coming months.