

Metaverse
Why AI needs to learn new languages – Crypto News
OpenAI has not revealed much about how ChatGPT-4 was built. But a look at its predecessor, ChatGPT-3, is suggestive. Large language models (LLMs) are trained on text scraped from the internet, on which English is the lingua franca. Around 93% of ChatGPT-3’s training data was in English. In Common Crawl, just one of the datasets on which the model was trained, English makes up 47% of the corpus, with other (mostly related) European languages accounting for 38% more. Chinese and Japanese combined, by contrast, made up just 9%. Telugu was not even a rounding error.
View Full Image
An evaluation by Nathaniel Robinson, a researcher at Johns Hopkins University, and his colleagues finds that is not a problem limited to ChatGPT. All LLMs fare better with “high-resource” languages, for which training data are plentiful, than for “low-resource” ones for which they are scarce. That is a problem for those hoping to export AI to poor countries, in the hope it might improve everything from schools to health care. Researchers around the world are therefore working to make AI more multilingual.
India’s government is particularly keen. Many of its public services are already digitised, and it is keen to fortify them with AI. In September, for instance, it launched a chatbot to help farmers get information about state benefits.
The bot works by welding two sorts of language model together, says Shankar Maruwada of the EkStep Foundation, a non-profit that helped build it. Users can submit queries in their native tongues. (Eight are supported so far; five more are coming soon.) These are passed to a piece of machine-translation software developed at IIT Madras, an Indian academic institution, which translates them into English. The English version of the question is then fed to the LLM, and its response translated back into the user’s mother tongue.
The system seems to work. But translating queries into an LLM’s preferred language is a rather clumsy workaround. After all, language is a vehicle for worldviews and culture as well as just meaning, notes the boss of one Indian AI firm. A paper by Rebecca Johnson, a researcher at the University of Sydney, published in 2022, found that ChatGPT-3 gave replies on topics such as gun control and refugee policy that aligned most with the values displayed by Americans in the World Values Survey, a global questionnaire of public opinion.
Many researchers are therefore trying to make LLMs themselves more fluent in less widely spoken languages. One approach is to modify the tokeniser, the part of an LLM that chops words into smaller chunks for the rest of the model to manipulate. Text in Devanagari, a script used with Hindi, needs three to four times more tokens, when tokenised the standard way, than the same text in English. An Indian startup called Sarvam AI has written a tokeniser optimised for Hindi, which cuts that number substantially. Fewer tokens means fewer computations. Sarvam reckons that OpenHathi, its Devanagari-optimised LLM, can cut the cost of answering questions by around three-quarters.
Another is to improve the datasets on which LLMs are trained. Often this means digitising reams of pen-and-paper texts. In November a team of researchers at Mohamed bin Zayed University, in Abu Dhabi, released the latest version of an Arabic-speaking model called “Jais”. It has one-sixth as many parameters (one measure of a model’s size) as ChatGPT-3, but performs on par with it in Arabic. Timothy Baldwin, the university’s acting provost, notes that, because his team could only digitise so much Arabic text, the model also included some English. Some concepts, after all, are similar across all languages, and can be learned in any tongue. Data in a specific language are more important for teaching the model specific cultural ideas and quirks.
The third approach is to tweak models after they have been trained. Both Jais and OpenHathi have had some question-and-answer pairs hand crafted by humans. The same happens with Western chatbots, to stop them spreading what their makers see as disinformation. Ernie Bot, an LLM from Baidu, a big Chinese tech company, has been tweaked to try to stop it saying things to which the government might object. Models can also learn from human feedback, in which users rate an LLM’s answers. But that is hard to do for many poor-world languages, says Dr Baldwin, since it requires recruiting people literate enough to criticise the machine’s writing.
How well all this will work remains to be seen. A quarter of India’s adults are illiterate, something that no amount of LLM tweaking will solve. Many Indians prefer using voice messages to communicate rather than text ones. AI can also turn speech into words, as India’s chatbot for farmers does. But that adds another step at which errors can creep in.
And it is possible that builders of local LLMs may eventually be put out of business by the efforts of the Silicon Valley big boys. Although it is far from perfect, ChatGPT-4 is much better than ChatGPT-3 at answering questions in non-English languages. However it is done, teaching AI to speak more of the world’s 7,000-odd languages can only be a good thing.
Curious about the world? To enjoy our mind-expanding science coverage, sign up to Simply Science, our weekly subscriber-only newsletter.
© 2024, The Economist Newspaper Limited. All rights reserved. From The Economist, published under licence. The original content can be found on www.economist.com
Milestone Alert!
Livemint tops charts as the fastest growing news website in the world 🌏 Click here to know more.
Unlock a world of Benefits! From insightful newsletters to real-time stock tracking, breaking news and a personalized newsfeed – it’s all here, just a click away! Login Now!
Download The Mint News App to get Daily Market Updates.
Published: 26 Mar 2024, 03:12 PM IST
-
Business6 days ago
PYMNTS’ Summer of Big Quotes, From Tariffs to Trust Codes – Crypto News
-
Cryptocurrency1 week ago
South Korea Busts Hacking Syndicate After Multi-Million Dollar Crypto Losses – Crypto News
-
Cryptocurrency1 week ago
A Stable Investment Backed by Real-World Assets – Crypto News
-
De-fi1 week ago
Stablecoin Platform M0 Raises $40 Million in Polychain-Led Series B – Crypto News
-
Blockchain3 days ago
Etherealize Raises $40M to Market Ethereum to Finance Firms – Crypto News
-
Metaverse1 week ago
AI Travel Assistance: Can AI assistants really plan a perfect vacation for you? Here’s the truth – Crypto News
-
Blockchain1 week ago
As US Data Moves to Blockchain, Should Businesses Follow? – Crypto News
-
others1 week ago
Japan CFTC JPY NC Net Positions rose from previous ¥77.6K to ¥84.5K – Crypto News
-
others3 days ago
XAG/USD bounces at $40.50, approaching $41.00 – Crypto News
-
Cryptocurrency1 week ago
Strategy Investors Drop Class Action Over Bitcoin Accounting Misrepresentation – Crypto News
-
others4 days ago
CFTC Gives Crypto Prediction Platform Polymarket Greenlight To Launch In the U.S. – Crypto News
-
Cryptocurrency1 week ago
US and Dutch Authorities Take Down Crypto-Fueled Fake ID Marketplace – Crypto News
-
Cryptocurrency3 days ago
Ripple (XRP) Slips 5% Weekly But Analysts See Potential for a New ATH – Crypto News
-
Business7 days ago
Tariffs Leave Businesses Struggling With Pricing Decisions – Crypto News
-
Cryptocurrency3 days ago
Reverse-takeover DATs are a grab bag of risks for investors – Crypto News
-
Technology1 week ago
AI can’t match human creativity, says Fields medalist Manjul Bhargava – Crypto News
-
Technology1 week ago
Gaming Firm Gumi Launches XRP Treasury to Accelerate Blockchain Push – Crypto News
-
Technology1 week ago
Meta may tap OpenAI, Google models to power AI features across WhatsApp, Instagram and Facebook: Report – Crypto News
-
De-fi1 week ago
Bitcoin Long-Term Holders Realize 2.37M BTC ($260.7B) in Second Most Profitable Cycle Amid Signs of Late Phase and October-November Peak – Crypto News
-
Cryptocurrency1 week ago
XYZVerse (XYZ) at $0.0054 Chosen Over Solana and Cardano for $10 This Bull Run Target – Crypto News
-
Technology1 week ago
Fear Grips Crypto Investors as “Trump is Dead” Speculation Spreads: Details – Crypto News
-
Technology1 week ago
iPhone 17 Pro Max vs iPhone 16 Pro Max: 7 big upgrades expected at Apple’s ‘Awe Dropping’ event – Crypto News
-
Cryptocurrency4 days ago
Historic Bitcoin-S&P decoupling fuels altseason hopes – All the details! – Crypto News
-
Cryptocurrency3 days ago
Regulatory Certainty for Crypto Front and Center on SEC’s Agenda – Crypto News
-
Blockchain4 days ago
Ukraine’s Parliament Supports Crypto Tax Bill at First Reading – Crypto News
-
Technology1 week ago
Microsoft AI product lead explains why org charts may disappear in the age of AI agents – Crypto News
-
Business4 days ago
1inch Taps Ondo Finance to Unlock Access to Tokenized RWAs – Crypto News
-
Blockchain1 week ago
Chainlink, Commerce Department Bring Data to Blockchain – Crypto News
-
Cryptocurrency1 week ago
Thursday links: ATMs, dad logic and zombie tokens – Crypto News
-
De-fi5 days ago
Kraken, Backed Expand Tokenized US Stocks to Ethereum via xStocks – Crypto News
-
others3 days ago
What’s Fueling Today’s Crypto Market Crash? – Crypto News
-
others3 days ago
Duluth Holdings (DLTH) tops Q2 earnings and revenue estimates – Crypto News
-
Technology1 week ago
Claude AI will train on your personal conversations by default — unless you change this setting – Crypto News
-
Cryptocurrency1 week ago
XRP Open Interest declines 30% as price consolidate below $3 – Crypto News
-
Technology1 week ago
Canadian Firm Luxxfolio Announces $72M Pivot From Bitcoin Mining to Litecoin Treasury – Crypto News
-
De-fi1 week ago
Eliza Labs Files Lawsuit Against Elon Musk’s X – Crypto News
-
Cryptocurrency5 days ago
Is SKY’s 10% surge a bull trap in disguise? Marking major levels – Crypto News
-
Technology4 days ago
India ranks among top 5 contributors to open source projects: CNCF – Crypto News
-
others4 days ago
Dow Jones falls flat as tech stocks rise – Crypto News
-
Blockchain4 days ago
Cardano Founder Says Chainlink Quoted Them An ‘Absurd Price’, Here’s Why – Crypto News
-
Cryptocurrency4 days ago
XRP Army Played Key Role in Ripple SEC Lawsuit, John Deaton Says – Crypto News
-
Business4 days ago
AlphaTON Capital Launches $100M TON Treasury Strategy, Rebrands as ATON on Nasdaq – Crypto News
-
Cryptocurrency4 days ago
Sky Protocol buyback program starts paying off as SKY token jumps 12% – Crypto News
-
Business4 days ago
Wintermute Addresses US SEC on Tokenized Securities as Coinbase, Kraken Seek License – Crypto News
-
Business4 days ago
Wintermute Addresses US SEC on Tokenized Securities as Coinbase, Kraken Seek License – Crypto News
-
others1 week ago
Indian Rupee trades calmly, outlook remains grim amid US tariffs and FIIs outflows – Crypto News
-
Cryptocurrency1 week ago
Ethereum eyes breakout to $5,000 as Cathie Wood bets on ETH treasury firm – Crypto News
-
others1 week ago
EUR/USD steady despite strong US GDP as Greenback stays under pressure – Crypto News
-
Technology1 week ago
PYTH skyrockets 60% as US government taps Pyth Network to verify economic data on-chain – Crypto News
-
Technology1 week ago
Best smart LED projector that can turn your living room into a mini theatre: Top picks for movies and gaming – Crypto News