When AI labs update their underlying large-scale language models, they often experience unexpected behavior, such as a complete change in how they respond to queries Apple researchers have developed new ways to improve user experience when familiar AI models are upgraded
In their paper, Apple researchers state that users develop their own systems for interacting with LLMs, including prompting styles and techniques Switching to a new model can be a draining process that degrades the experience of using an AI model
As a result of the update, users may be forced to change the way they write prompts, and while early adopters of the ChatGPT model may accept this, mainstream users on iOS will likely find this unacceptable will find this unacceptable
To address this issue, the team considered creating a metric to compare regressions and inconsistencies between different model versions and also developed a training strategy to minimize these inconsistencies from occurring in the first place
It is not clear if this will be part of Apple Intelligence in future iOS, but it is clear that Apple is preparing for what will happen in the future when they update the underlying model to ensure that Siri responds to the same queries in the same way
The researchers state that by using their new method, they have succeeded in reducing by up to 40% the number of negative flips where the old model gives the correct answer and the new model gives the wrong answer
The authors of the paper also agree that the mistakes made by the new model are guaranteed to match the mistakes made by the old model
“We argue that there is value in being consistent when both models are wrong,” they said, adding that “users may develop coping strategies for how they interact with their models when they are wrong” Thus, inconsistency leads to user dissatisfaction
They called the method used to overcome these obstacles MUSCLE (an acronym for Model Update Strategy for Compatible LLM Evolution), which does not require changing the learning of the base model, but essentially makes the LLM relies on learning adapters that are plug-ins They call these compatibility adapters
The research team updated LLMs like Llama and Phi to test if this system works Testing done by the research team included asking the updated models math questions to see if the answers to certain questions were correct
The researchers say that with the proposed MUSCLE system, they were able to reduce these negative flips considerably Sometimes they were reduced by as much as 40%
Given the fast pace at which chatbots like ChatGPT and Google's Gemini are being updated, Apple's research has the potential to make new versions of these tools more reliable It would be a shame if users had to make the trade-off of switching to a newer model only to suffer a worse user experience
Comments