A new way to let AI chatbots converse all day without crashing
Credit: Christine Daniloff, MIT

When a human-AI conversation involves many rounds of continuous dialogue, the powerful large language machine-learning models that drive chatbots like ChatGPT sometimes start to collapse, causing the bots' performance to rapidly deteriorate.

A team of researchers from MIT and elsewhere has pinpointed a surprising cause of this problem and developed a simple solution that enables a chatbot to maintain a nonstop without crashing or slowing down.

Their method involves a tweak to the key-value cache (which is like a conversation memory) at the core of many large language models. In some methods, when this cache needs to hold more information than it has capacity for, the first pieces of data are bumped out. This can cause the to fail.

By ensuring that these first few remain in memory, the researchers' method allows a chatbot to keep chatting no matter how long the conversation goes.

The method, called StreamingLLM, enables a model to remain efficient even when a conversation stretches on for more than 4 million words. When compared to another method that avoids crashing by constantly recomputing part of the past conversations, StreamingLLM performed more than 22 times faster.

This could allow a chatbot to conduct long conversations throughout the workday without needing to be continually rebooted, enabling efficient AI assistants for tasks like copywriting, editing, or generating code.

"Now, with this method, we can persistently deploy these large language models. By making a chatbot that we can always chat with, and that can always respond to us based on our recent conversations, we could use these chatbots in some ," says Guangxuan Xiao, an electrical engineering and computer science (EECS) graduate student and lead author of a paper on StreamingLLM now posted to the arXiv preprint server.

Xiao's co-authors include his advisor, Song Han, an associate professor in EECS, a member of the MIT-IBM Watson AI Lab, and a distinguished scientist of NVIDIA; as well as Yuandong Tian, a research scientist at Meta AI; Beidi Chen, an assistant professor at Carnegie Mellon University; and senior author Mike Lewis, a research scientist at Meta AI. The work will be presented at the International Conference on Learning Representations held May 7–11 in Vienna.

A puzzling phenomenon

Large language models encode data, like words in a user query, into representations called tokens. Many models employ what is known as an attention mechanism that uses these tokens to generate new text.

Typically, an AI chatbot writes new text based on text it has just seen, so it stores recent tokens in memory, called a KV Cache, to use later. The attention mechanism builds a grid that includes all tokens in the cache, an "attention map" that maps out how strongly each token, or word, relates to each other token.

Understanding these relationships is one feature that enables large language models to generate human-like text.

But when the cache gets very large, the attention map can become even more massive, which slows down computation.

Also, if encoding content requires more tokens than the cache can hold, the model's performance drops. For instance, one popular model can store 4,096 tokens, yet there are about 10,000 tokens in an academic paper.

To get around these problems, researchers employ a "sliding cache" that bumps out the oldest tokens to add new tokens. However, the model's performance often plummets as soon as that first token is evicted, rapidly reducing the quality of newly generated words.

In this new paper, researchers realized that if they keep the first token in the sliding cache, the model will maintain its performance even when the cache size is exceeded.

But this didn't make any sense. The first word in a novel likely has nothing to do with the last word, so why would the first word be so important for the model to generate the newest word?

In their new paper, the researchers also uncovered the cause of this phenomenon.

More information: Guangxuan Xiao et al, Efficient Streaming Language Models with Attention Sinks, arXiv (2023). DOI: 10.48550/arxiv.2309.17453

This story is republished courtesy of MIT News (web.mit.edu/newsoffice/), a popular site that covers news about MIT research, innovation and teaching.

Citation: A new way to let AI chatbots converse all day without crashing (2024, February 13) retrieved 13 February 2024 from https://techxplore.com/news/2024-02-ai-chatbots-converse-day.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.