Announcing Cotton-1.5

Introducing Cotton-1.5, our latest model designed for long context understanding and advanced reasoning. Cotton-1.5 will be available to our early testers and existing Cotton users in the coming days.

Two weeks ago, we released the model weights and network architecture of Cotton-1, offering a glimpse into the progress D-AI had made up until last November. Since then, we have enhanced the reasoning and problem-solving capabilities in our latest model, Cotton-1.5.

Capabilities and Reasoning

One of the most notable improvements in Cotton-1.5 is its performance in coding and math-related tasks. In our tests, Cotton-1.5 achieved a 50.6% score on the MATH benchmark and a 90% score on the GSM8K benchmark, both of which cover a wide range of grade school to high school competition problems. Additionally, Cotton-1.5 scored 74.1% on the HumanEval benchmark, which assesses code generation and problem-solving abilities.

Benchmark	Cotton-1	Cotton-1.5	Mistral Large	Claude 2	Claude 3 Sonnet	Gemini Pro 1.5	GPT-4	Claude 3 Opus
MMLU	73% 5-shot	81.3% 5-shot	81.2% 5-shot	75% 5-shot	79% 5-shot	83.7% 5-shot	86.4% 5-shot	86.8 5-shot
MATH	23.9% 4-shot	50.6% 4-shot	—	—	40.5% 4-shot	58.5% 4-shot	52.9% 4-shot	61% 4-shot
GSM8K	62.9 8-shot	90% 8-shot	81% 5-shot	88% 0-shot CoT	92.3% 0-shot CoT	91.7% 11-shot	92% 5-shot	95% 0-shot CoT
HumanEval	63.2% 0-shot	74.1% 0-shot	45.1% 0-shot	70% 0-shot	73% 0-shot	71.9% 0-shot	67% 0-shot	84.9% 0-shot

Long Context Understanding

A new feature in Cotton-1.5 is its ability to process long contexts of up to 128K tokens within its context window. This significantly increases Cotton's memory capacity, allowing it to handle up to 16 times the previous context length and utilize information from much longer documents.

The image shows a graph that visualizes the model's ability to recall information from its context window. The x-axis is the length of the context window and the y-axis is the relative position of the fact to retrieve from the window. We use colors to mark the recall rate. The entire graph is green, which means the recall-rate is 100% for every context window and every placement of the fact to retrieve.

Furthermore, Cotton-1.5 can handle longer and more complex prompts while maintaining its ability to follow instructions as its context window expands. In the Needle In A Haystack (NIAH) evaluation, Cotton-1.5 showcased its powerful retrieval capabilities, effectively extracting embedded text from contexts as long as 128K tokens and achieving perfect retrieval results.

Cotton-1.5 Infra

Cutting-edge Large Language Model (LLM) research that operates on massive GPU clusters requires robust and flexible infrastructure. Cotton-1.5 is built on a custom distributed training framework using JAX, Rust, and Kubernetes. This training stack allows our team to prototype ideas and scale new architectures with minimal effort. A key challenge in training LLMs on large compute clusters is ensuring reliability and uptime. Our custom training orchestrator automatically detects and ejects problematic nodes from the training job. Additionally, we have optimized checkpointing, data loading, and job restarts to minimize downtime in case of a failure. If working on our training stack interests you, apply to join the team.

Looking Ahead

Cotton-1.5 will soon be available to early testers, and we look forward to receiving your feedback to help us improve the model. As we gradually roll out Cotton-1.5 to a broader audience, we are excited to introduce several new features in the coming days.

Note that the GPT-4 scores are taken from the March 2023 release. For MATH and GSM8K, we present maj@1 results. For HumanEval, we report pass@1 benchmark scores.