Last night, DeepSeek multimodal researcher Chen Xiaokang posted a tweet on X, unveiling a new paper on multimodal technology titled "Thinking with Visual Primitives," saying he was "Excited to release."
Early this morning, the tweet was deleted, and the paper was also pulled from GitHub.
But APPSO read the full text before it vanished. After reading it, we felt the paper's removal might not be due to problematic content.
On the contrary, it may have revealed too much.
The day before yesterday, we just tested DeepSeek's image recognition mode. We asked it to count fingers, it thought for a while, then complained to itself, "I'm really getting dizzy counting," and got it wrong. At the time, it seemed like a minor glitch in the beta testing phase.
This paper tells us that the "count fingers dizziness" incident hides a technical bottleneck that GPT, Claude, and Gemini have collectively failed to solve well.
And DeepSeek's solution is almost laughably simple: Give the AI a finger.
Chen Xiaokang wrote in that tweet:
"Traditional CoT stays in the linguistic space, but visual reasoning needs more. By using points and boxes as cognitive anchors, our model bridges the Reference Gap—mimicking the 'point-to-reason' synergy humans use."
"Traditional Chain-of-Thought stays in the linguistic space, but visual reasoning needs more. By using points and boxes as cognitive anchors, our model bridges the Reference Gap—mimicking the 'point-to-reason' synergy humans use."
Seeing Clearly and Pointing Accurately Are Two Different Things
Currently, all multimodal large models perform image reasoning by essentially converting the visual scene into language, then doing Chain-of-Thought reasoning in the linguistic space. GPT-5.4, Claude-Sonnet-4.6, Gemini-3-Flash—all follow this path.
Over the past two years, improvement efforts from OpenAI, Google, and Anthropic have focused on one problem: how to make the model see more clearly. High-resolution cropping, dynamic tiling, enlarging and feeding images—DeepSeek calls this the Perception Gap.
But this paper points out another bottleneck: the Reference Gap. The model sees clearly, but during the reasoning process, it cannot precisely point to a specific thing in the image.
Think of it this way: in a tightly packed photo of 25 people, using language to describe "the person next to the person in the blue jersey, third row from the left" is inherently fuzzy. The model loses context while counting, forgetting who it just counted.
How do humans solve this problem? Primally: extend a finger, point, and count one by one.
A 284B Parameter Model, Equipped with a Finger
DeepSeek's plan: let the model directly output coordinates on the image during its thought process.
Imagine the model sees a picture with many people. Its chain of thought is no longer "I see a person in blue on the left," but "I see this person," accompanied by bounding box coordinates that circle the person. Every person counted gets a box; after circling, counting the boxes gives the answer.
Two coordinate formats: one is a bounding box, drawing a rectangle around an object, suitable for locating it; the other is a point, marking a spot on the image, suitable for tracking paths and solving mazes. DeepSeek calls these two things "visual primitives," the smallest units of thought.
The key change is here: before, the model output coordinates as a final answer ("the target is here"). Now, coordinates are embedded in the thought process itself. Coordinates are marks on scratch paper, not the answer on the exam sheet.
Compressing an Image by 7056 Times, Then Still Accurately Counting the People Inside
The model base is DeepSeek-V4-Flash, a 284B parameter MoE model. MoE means: the model's "brain" is huge, but only a small fraction of neurons are activated for each query, activating only 13B parameters during inference. It's like a 100-person team where only 5 people are sent to do any given task.
On the visual encoder side, three-level compression is performed. Analogy: you want to send a photo to a friend with slow internet. Step one, you cut the photo into small grids. Step two, every 3x3 grid is merged into one (3x3 compression). Step three, further reduce redundant information during transmission (KV Cache compression 4x).
Actual numbers: a 756x756 pixel image, 570,000 pixels, gets compressed down to 81 information units. A compression ratio of 7,056 times.
My first reaction seeing that number was: can it still see clearly? But the results in the paper show, yes it can. Not only can it see clearly, it can precisely count 25 people in an image.
For comparison: for the same 800x800 image, Gemini-3-Flash consumes about 1100 tokens to represent the image, Claude-Sonnet-4.6 about 870, GPT-5.4 about 740. DeepSeek uses only 90 information units in its final computation. Others use over a thousand grids to remember the image; DeepSeek uses 90 grids, then uses the saved computational power entirely for "pointing."
How 40 Million Training Data Points Were Accumulated
DeepSeek crawled all datasets with "object detection" tags from platforms like Huggingface, initially filtering down to 97,984 data sources.
Then two rounds of screening were done.
Round one checked label quality. AI auto-audited for three types of problems: labels that are meaningless numeric IDs (category names like "0" or "1"), labels that are private entities ("MyRoommate"), and labels that are ambiguous abbreviations (like "OK" or "NG" in industrial inspection—an "OK" apple looks nothing like an "OK" circuit board, the AI cannot learn from that). This round cut 56%, leaving 43,141 sources.
Round two checked bounding box quality. Three criteria: too many missed labels (half the objects unlabeled), boxes drawn askew cutting off half the object, and boxes so large they cover the entire image (indicating the original data was image classification forced into detection format, with no localization info). Another 27% was cut, leaving 31,701 sources.
Finally, category-based sampling and deduplication produced over 40 million high-quality samples.
DeepSeek chose to scale up box data first, supplementing point data later. The reason is simple: asking the AI to draw a box has a near-unique correct answer (just surrounding the object). But asking the AI to mark a point—any spot on the object counts as correct—lacks a unique answer, making the training signal too fuzzy. Moreover, a box inherently contains two points (top-left and bottom-right), so learning to draw boxes makes pointing a dimensionality reduction task afterward.
How to Teach the "Pointing" Capability to the Model
The post-training strategy is "train separately, then merge."
DeepSeek first used bounding box data to train a specialist box-drawing expert model, then used point data to train a specialist point-marking expert model. Training was separated because the data volume wasn't large enough yet; mixing both capabilities could cause mutual interference.
Then, reinforcement learning was applied to each expert separately. How to judge if the model "drew the box correctly" or "followed the path correctly"? DeepSeek designed a multi-dimensional scoring system: Is the format correct (coordinate syntax valid)? Is the logic coherent (no self-contradiction in the thought process)? Is the answer accurate (deviation between the final result and the gold standard)?
Data selection for reinforcement learning also has its nuances: first, let the model attempt the same question N times. Questions answered correctly every time are too easy and have no training value; questions answered incorrectly every time are too hard to learn from. Only questions with a mix of correct and incorrect attempts are kept for training.
The final step merges the two expert capabilities into one model. The specific method: have the unified model learn by mimicking the outputs of the two expert teachers, similar to a student learning different subjects from two tutors simultaneously.
How It Counts After Being Given a Finger
Counting 25 People
Give the model a team photo of a soccer team, and ask, "How many people are in the picture?"
Thought process: First, it judges, "This is a team photo, must count everyone, including players and coaches." Then, it outputs 25 bounding box coordinates at once, circling each person. Next, it counts by row: 4 sitting in the front row + 9 in the middle row + 8 in the back row + 2 coaches on the left + 2 coaches on the right = 25.
"How many bears are on the ground?"
There are three bears in the picture. The model draws a box around each one and judges its position: first one, climbing vertically on a tree trunk, exclude; second one, walking at the edge of a rock, count; third one, among wood chips and dirt, count. Answer: 2.
It didn't just count three and subtract one. It made a "is this on the ground?" judgment for each bear, with each judgment anchored by a specific coordinate. It was genuinely checking one by one, not guessing.
Multi-hop Spatial Reasoning
A 3D rendered scene has a bunch of colorful geometric objects. Question: "Is there a purple rubber object the same size as the gray metal object?"
The model first boxes the gray metal sphere, confirming it's a small object. Then it proceeds to box other small objects in the scene one by one: brown metal cylinder, blue metal cube, blue rubber cube, yellow rubber cylinder... six objects checked in total, verifying three attributes—color, material, size—one by one. Conclusion: No purple rubber one exists.
Six localizations, six judgments. Every step is anchored by coordinates, eliminating the "wait, where was I?" problem.
More case studies from the paper:
Maze Navigation: While Others Flip Coins, DeepSeek Actually Searches
The paper tested four types of tasks, and the maze was where the performance gap widened the most.
The task is straightforward: given a maze image, determine if there is a path from start to finish, and if so, draw it. Three maze shapes were used: square grid, circular ring, and honeycomb.
The model navigates the maze just like you did as a kid using a pencil on paper: choose a path branch, follow it to the end, backtrack if it's a dead end, and try another. The difference is, every step it takes marks a coordinate point on the map, keeping a record.
The paper shows the full process for a circular maze: the model first marks the start and end points, then begins exploring. It took 18 steps, entered two dead ends in the middle and retreated, finally winding out a path, outputting the entire path's coordinate sequence.
DeepSeek also designed a batch of trap mazes: at first glance, a path seems to exist, but a segment in the middle is secretly blocked. This type of maze tests patience; the model can't just look at the trend near the entrance and conclude—it has to honestly try all traversable routes to confirm no path exists.
Accuracy comparison:
- DeepSeek: 66.9%
- GPT-5.4: 50.6%
- Claude-Sonnet-4.6: 48.9%
- Gemini-3-Flash: 49.4%
- Qwen3-VL: 49.6%
A maze has only two answers: there is a path, or there isn't. Random guessing is exactly 50%. GPT, Claude, Gemini, Qwen all hover around 50%, indistinguishable from a coin toss. DeepSeek's 66.9% isn't high, but it is indeed walking step-by-step, not just guessing.
Path Tracing: The Ultimate Version of "Spot the Difference"
This task is more intuitive: a bundle of tangled lines, each leading from one marker to another. The mess your headphone cable turns into in your pocket is what the image looks like. The question: Where does line C lead to?
The model's approach is to output coordinate points along the line, like sliding a finger across paper. Points are dense where the line bends sharply, and sparse on straight segments. The human eye tracks a line the same way: slowing down at curves, scanning quickly over straightaways.
The paper also added a harder test version: all lines have the same color and thickness. You can't rely on color to distinguish lines anymore; you must depend on the continuity of the curve's trajectory to judge which line to follow at intersections.
- DeepSeek: 56.7%
- GPT-5.4: 46.5%
- Claude-Sonnet-4.6: 30.6%
- Gemini-3-Flash: 41.4%
Claude's 30.6% is somewhat surprising. With an endpoint usually having four or five options, random guessing should yield slightly over 20%. 30.6% is only marginally better than guessing wildly. Perhaps its linguistic reasoning inertia actually hindered it in this type of pure spatial tracking task.
How to Teach AI to Navigate Mazes Without Cheating
Maze training has a practical problem: if the score only depends on whether the final answer is correct, the model quickly learns to be cunning. Rather than bother searching and possibly getting it wrong, it can just guess. A failed serious attempt and a no-effort guess both score zero.
DeepSeek's solution is to factor the process into the score. Each legal exploratory step earns points; walking through walls deducts points. The farther it explores, the better. Even if it doesn't reach the end, as long as it thoroughly searches most of the area, it can get a decent score. This way, the model has no motivation to slack off.
The requirement for unsolvable mazes is even higher: you can't just say "no path"—you must prove you've indeed traversed all reachable areas. Search coverage is also scored.
One Easter Egg, Three Limitations
The post-training data contained no Chinese. Yet the model can perform visual primitive reasoning in Chinese.
Given a photo of a coffee machine and asked in Chinese, "How to make a latte," it annotates the coordinates of the steam wand, milk pitcher, coffee beans, and latte button in Chinese, then provides operational steps. Multilingual ability is inherited from the base model; the visual primitive training did not destroy it.
It can also combine visual perception with world knowledge: shown a photo of the Golden Gate Bridge and asked, "Is there an NBA team nearby?" It first boxes the Golden Gate Bridge, infers it's San Francisco, and answers the Golden State Warriors.
It can understand humor: natural spots on a fruit cross-section happened to form a melancholic cat face. The model pointed out where the resemblance lay and explained why it was funny.
It can provide escape room guidance: it boxes the key on a high shelf, a chair on the floor, and a locked door, suggesting: "Move the chair under the key → stand on it to get the key → go unlock the door."
The paper very candidly wrote down current limitations.
Input resolution is limited. ViT output is capped between 81 to 384 visual information units. For very fine-grained scenes (like counting fingers), the coordinate precision is insufficient. This might be the direct reason for the finger-counting failure in our test the day before yesterday.
Currently, specific trigger words are needed to activate the visual primitive mode. The model cannot yet autonomously decide "I should use my finger for this problem"—someone has to remind it.
Generalization of topological reasoning is limited. Performance is good on trained maze types, but it can fail when encountering a new spatial structure. Chen Xiaokang also stated in that deleted tweet:
"We're still in the early stages; generalization in complex topological reasoning tasks isn't perfect yet, but we're committed to solving it."
"We're still in the early stages; generalization in complex topological reasoning tasks isn't perfect yet, but we're committed to solving it."
The capabilities shown by DeepSeek's image recognition mode in our test the day before yesterday (inquiring about the publisher's identity, associating the meaning of the whale logo, self-correcting, holding a "mini defense meeting" with itself) are of the same lineage as the thinking method described in this paper. It builds visual anchors in its mind, reasons around them, and backtracks to correct contradictions when encountered.
The finger-counting dizziness was a live demonstration of the Reference Gap. In an image with overlapping fingers, relying purely on language descriptions to distinguish "the third finger from the left" from "the second finger from the right" is destined for confusion—just like you trying to count a crowded group without using your own finger to point.
The direction this paper points to is: the next evolution in multimodal reasoning lies in anchoring mechanisms. DeepSeek matched the effects of others using thousands of tokens with just 90 information units, using the saved computational power entirely to let the model "think while pointing."
The resolution arms race can take a breather. Teaching the model to point its finger is more effective than buying it a more expensive pair of glasses.
This whale, after opening its eyes, has grown a finger. 66.9% maze accuracy is far from perfect, but at least it's diligently walking, unlike the neighbors who are flipping coins.