News
We evaluate our Qwen2.5-Math base models on three widely used English math benchmarks GSM8K, Math, and MMLU-STEM. In addition, we also evaluate three Chinese math benchmarks CMATH, GaoKao Math Cloze, ...
Problem #1: Most browser agents draw numbered boxes around page elements - doesn't generalize well due to complex modern sites. Solution: Vision-first architecture. Visually grounded LLM specifies ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results