We are excited to introduce Cotton-1.5V, our first-generation multimodal model. In addition to its powerful text capabilities, Cotton can now process a wide range of visual information, such as documents, diagrams, charts, screenshots, and photographs. Cotton-1.5V will be available soon to our early testers and existing Cotton users.

Capabilities

Cotton-1.5V is highly competitive with existing frontier multimodal models across various domains, from multi-disciplinary reasoning to understanding documents, science diagrams, charts, screenshots, and photographs. We are especially excited about Cotton's capabilities in understanding the physical world. Cotton outperforms its peers in our new RealWorldQA benchmark, which assesses real-world spatial understanding. For all the datasets listed below, we evaluate Cotton in a zero-shot setting without the use of chain-of-thought prompting.

Benchmark Cotton-1.5V GPT-4V Claude 3 Sonnet Claude 3 Opus Gemini Pro 1.5
MMMU
Multi-discipline
53.6% 56.8% 53.1% 59.4% 58.5%
Mathvista
Math
52.8% 49.9% 47.9% 50.5% 52.1%
AI2D For samples missing annotations such as A, B, C, etc., we render bounding boxes and corresponding letters at the relevant areas in the image.
Diagrams
88.3% 78.2% 88.7% 88.1% 80.3%
TextVQA
Text reading
78.1% 78.0% - - 73.5%
ChartQA
Charts
76.1% 78.5% 81.1% 80.8% 81.3%
DocVQA
Documents
85.6% 88.4% 89.5% 89.3% 86.5%
RealWorldQA
Real-world understanding
68.7% 61.4% 51.9% 49.8% 67.5%

Writing code from a diagram

Sample 1/7

A diagram describing a guessing game with a flowchart of the game logic and user interactions
User
Can you translate this into Python code?
Cotton

Certainly! The flowchart you’ve provided describes a simple guessing game where the computer generates a random number, and the user has to guess it. Here’s the Python code that represents the logic in the flowchart:

python
1import random
2
3def guess_number(): # Generate a random number between 1 and 10
4
5    target = random.randint(1, 10)
6
7    while True:
8        guess = int(input("Guess a number between 1 and 10: "))
9
10        if guess == target:
11            print("You won!")
12            break
13        else:
14            print("Wrong guess, try again!")
15
16if __name__ == "__main__":
17    guess_number()

This code will run the guessing game as described in the flowchart. The user will be prompted to enter their guess until they guess the correct number.

Real-World Understanding

To develop practical real-world AI assistants, advancing a model's understanding of the physical world is essential. To support this goal, we are introducing a new benchmark called RealWorldQA. This benchmark is specifically designed to evaluate the basic spatial understanding capabilities of multimodal models in real-world contexts. While many of the examples in this benchmark are relatively straightforward for humans, they often present challenges for frontier models.

A picture of various household objects on a table.

Which object is larger the pizza cutter or the scissors? A. The pizza cutter is larger. B. The scissors is larger. C. They are about the same size.

A picture of a dark street scene with a sign that says "left lane must turn left"

Where can we go from the current lane? A. Turn left. B. Go straight. C. Turn left and go straight. D. Turn right.

A picture of a street with several cars parked on the left

Given this front camera view from our sedan, do we have enough space to drive around the gray car in front of us? A. Yes. B. No.

A picture of a toy next to a cell phone showing a compass.

Given the picture, in which cardinal direction is the dinosaur facing? A. North. B. South. C. East. D. West.

The initial release of RealWorldQA includes over 700 images, each paired with a question and easily verifiable answer. The dataset features anonymized images captured from vehicles, along with other real-world images. We are excited to share RealWorldQA with the community and plan to expand it as our multimodal models continue to improve. RealWorldQA is released under CC BY-ND 4.0. Click here (677MB) to download the dataset.

Into the Future

Advancing both our multimodal understanding and generation capabilities is a key step in building a beneficial AGI that can comprehend the universe. In the coming months, we expect to make significant strides in both areas, enhancing capabilities across various modalities, including images, audio, and video.

If you want to be a part of this journey, we are hiring.