Large language models are rubbish at elementary level math

“9.11 and 9.9, which one is bigger?” Questions as simple as this confuse large language models including OpenAI’s GPT-4o, Moonshot-created Kimi, and ByteDance’s Doubao, according to a post by local media Yicai. Chatbots from China’s Baidu and Tencent generate the correct answer despite using different methods, the former comparing fractional parts after concluding the integer parts are the same and the latter, Tencent’s Hunyuan, concluding that 9.9 is the bigger number by computing that 9.11 minus 9.9 is negative. ChatGPT and Kimi, which both gave a wrong answer to the first prompt, were correct after users clarified: “in terms of numerical value.” AI-powered chatbots are fed by internet data and trained to chat with humans in a natural way so that they can perform text-based knowledge-based tasks. [Yicai, in Chinese]

Large language models are rubbish at elementary level math · TechNode