Introducing Cotton-1.5, our latest model designed for long context understanding and advanced reasoning. Cotton-1.5 will be available to our early testers and existing Cotton users in the coming days.

Two weeks ago, we released the model weights and network architecture of Cotton-1, offering a glimpse into the progress D-AI had made up until last November. Since then, we have enhanced the reasoning and problem-solving capabilities in our latest model, Cotton-1.5.

Capabilities and Reasoning

One of the most notable improvements in Cotton-1.5 is its performance in coding and math-related tasks. In our tests, Cotton-1.5 achieved a 50.6% score on the MATH benchmark and a 90% score on the GSM8K benchmark, both of which cover a wide range of grade school to high school competition problems. Additionally, Cotton-1.5 scored 74.1% on the HumanEval benchmark, which assesses code generation and problem-solving abilities.

Benchmark Cotton-1 Cotton-1.5 Mistral Large Claude 2 Claude 3 Sonnet Gemini Pro 1.5 GPT-4 Claude 3 Opus
MMLU 73%
5-shot
81.3%
5-shot
81.2%
5-shot
75%
5-shot
79%
5-shot
83.7%
5-shot
86.4%
5-shot
86.8
5-shot
MATH 23.9%
4-shot
50.6%
4-shot


40.5%
4-shot
58.5%
4-shot
52.9%
4-shot
61%
4-shot
GSM8K 62.9
8-shot
90%
8-shot
81%
5-shot
88%
0-shot CoT
92.3%
0-shot CoT
91.7%
11-shot
92%
5-shot
95%
0-shot CoT
HumanEval 63.2%
0-shot
74.1%
0-shot
45.1%
0-shot
70%
0-shot
73%
0-shot
71.9%
0-shot
67%
0-shot
84.9%
0-shot

Long Context Understanding

A new feature in Cotton-1.5 is its ability to process long contexts of up to 128K tokens within its context window. This significantly increases Cotton's memory capacity, allowing it to handle up to 16 times the previous context length and utilize information from much longer documents.

The image shows a graph that visualizes the model's ability to recall information from its context window. The x-axis is the length of the context window and the y-axis is the relative position of the fact to retrieve from the window. We use colors to mark the recall rate. The entire graph is green, which means the recall-rate is 100% for every context window and every placement of the fact to retrieve.

Furthermore, Cotton-1.5 can handle longer and more complex prompts while maintaining its ability to follow instructions as its context window expands. In the Needle In A Haystack (NIAH) evaluation, Cotton-1.5 showcased its powerful retrieval capabilities, effectively extracting embedded text from contexts as long as 128K tokens and achieving perfect retrieval results.

Cotton-1.5 Infra

Cutting-edge Large Language Model (LLM) research that operates on massive GPU clusters requires robust and flexible infrastructure. Cotton-1.5 is built on a custom distributed training framework using JAX, Rust, and Kubernetes. This training stack allows our team to prototype ideas and scale new architectures with minimal effort. A key challenge in training LLMs on large compute clusters is ensuring reliability and uptime. Our custom training orchestrator automatically detects and ejects problematic nodes from the training job. Additionally, we have optimized checkpointing, data loading, and job restarts to minimize downtime in case of a failure. If working on our training stack interests you, apply to join the team.

Looking Ahead

Cotton-1.5 will soon be available to early testers, and we look forward to receiving your feedback to help us improve the model. As we gradually roll out Cotton-1.5 to a broader audience, we are excited to introduce several new features in the coming days.

Note that the GPT-4 scores are taken from the March 2023 release. For MATH and GSM8K, we present maj@1 results. For HumanEval, we report pass@1 benchmark scores.