We are excited to introduce Cotton-1.5V, our first-generation multimodal model. In addition to its powerful text capabilities, Cotton can now process a wide range of visual information, such as documents, diagrams, charts, screenshots, and photographs. Cotton-1.5V will be available soon to our early testers and existing Cotton users.
Capabilities
Cotton-1.5V is highly competitive with existing frontier multimodal models across various domains, from multi-disciplinary reasoning to understanding documents, science diagrams, charts, screenshots, and photographs. We are especially excited about Cotton's capabilities in understanding the physical world. Cotton outperforms its peers in our new RealWorldQA benchmark, which assesses real-world spatial understanding. For all the datasets listed below, we evaluate Cotton in a zero-shot setting without the use of chain-of-thought prompting.
Benchmark | Cotton-1.5V | GPT-4V | Claude 3 Sonnet | Claude 3 Opus | Gemini Pro 1.5 |
---|---|---|---|---|---|
MMMU Multi-discipline |
53.6% | 56.8% | 53.1% | 59.4% | 58.5% |
Mathvista Math |
52.8% | 49.9% | 47.9% | 50.5% | 52.1% |
AI2D
For
samples missing annotations such as
A, B, C, etc., we render bounding
boxes and corresponding letters at
the relevant areas in the
image.
Diagrams |
88.3% | 78.2% | 88.7% | 88.1% | 80.3% |
TextVQA Text reading |
78.1% | 78.0% | - | - | 73.5% |
ChartQA Charts |
76.1% | 78.5% | 81.1% | 80.8% | 81.3% |
DocVQA Documents |
85.6% | 88.4% | 89.5% | 89.3% | 86.5% |
RealWorldQA Real-world understanding |
68.7% | 61.4% | 51.9% | 49.8% | 67.5% |
Writing code from a diagram
Sample 1/7

User
Cotton
Certainly! The flowchart you’ve provided describes a simple guessing game where the computer generates a random number, and the user has to guess it. Here’s the Python code that represents the logic in the flowchart:
1import random
2
3def guess_number(): # Generate a random number between 1 and 10
4
5 target = random.randint(1, 10)
6
7 while True:
8 guess = int(input("Guess a number between 1 and 10: "))
9
10 if guess == target:
11 print("You won!")
12 break
13 else:
14 print("Wrong guess, try again!")
15
16if __name__ == "__main__":
17 guess_number()
This code will run the guessing game as described in the flowchart. The user will be prompted to enter their guess until they guess the correct number.
Real-World Understanding
To develop practical real-world AI assistants, advancing a model's understanding of the physical world is essential. To support this goal, we are introducing a new benchmark called RealWorldQA. This benchmark is specifically designed to evaluate the basic spatial understanding capabilities of multimodal models in real-world contexts. While many of the examples in this benchmark are relatively straightforward for humans, they often present challenges for frontier models.

Which object is larger the pizza cutter or the scissors? A. The pizza cutter is larger. B. The scissors is larger. C. They are about the same size.

Where can we go from the current lane? A. Turn left. B. Go straight. C. Turn left and go straight. D. Turn right.

Given this front camera view from our sedan, do we have enough space to drive around the gray car in front of us? A. Yes. B. No.

Given the picture, in which cardinal direction is the dinosaur facing? A. North. B. South. C. East. D. West.
The initial release of RealWorldQA includes over 700 images, each paired with a question and easily verifiable answer. The dataset features anonymized images captured from vehicles, along with other real-world images. We are excited to share RealWorldQA with the community and plan to expand it as our multimodal models continue to improve. RealWorldQA is released under CC BY-ND 4.0. Click here (677MB) to download the dataset.
Into the Future
Advancing both our multimodal understanding and generation capabilities is a key step in building a beneficial AGI that can comprehend the universe. In the coming months, we expect to make significant strides in both areas, enhancing capabilities across various modalities, including images, audio, and video.
If you want to be a part of this journey, we are hiring.