Metaverse
Why AI needs to learn new languages – Crypto News
OpenAI has not revealed much about how ChatGPT-4 was built. But a look at its predecessor, ChatGPT-3, is suggestive. Large language models (LLMs) are trained on text scraped from the internet, on which English is the lingua franca. Around 93% of ChatGPT-3’s training data was in English. In Common Crawl, just one of the datasets on which the model was trained, English makes up 47% of the corpus, with other (mostly related) European languages accounting for 38% more. Chinese and Japanese combined, by contrast, made up just 9%. Telugu was not even a rounding error.
View Full Image
An evaluation by Nathaniel Robinson, a researcher at Johns Hopkins University, and his colleagues finds that is not a problem limited to ChatGPT. All LLMs fare better with “high-resource” languages, for which training data are plentiful, than for “low-resource” ones for which they are scarce. That is a problem for those hoping to export AI to poor countries, in the hope it might improve everything from schools to health care. Researchers around the world are therefore working to make AI more multilingual.
India’s government is particularly keen. Many of its public services are already digitised, and it is keen to fortify them with AI. In September, for instance, it launched a chatbot to help farmers get information about state benefits.
The bot works by welding two sorts of language model together, says Shankar Maruwada of the EkStep Foundation, a non-profit that helped build it. Users can submit queries in their native tongues. (Eight are supported so far; five more are coming soon.) These are passed to a piece of machine-translation software developed at IIT Madras, an Indian academic institution, which translates them into English. The English version of the question is then fed to the LLM, and its response translated back into the user’s mother tongue.
The system seems to work. But translating queries into an LLM’s preferred language is a rather clumsy workaround. After all, language is a vehicle for worldviews and culture as well as just meaning, notes the boss of one Indian AI firm. A paper by Rebecca Johnson, a researcher at the University of Sydney, published in 2022, found that ChatGPT-3 gave replies on topics such as gun control and refugee policy that aligned most with the values displayed by Americans in the World Values Survey, a global questionnaire of public opinion.
Many researchers are therefore trying to make LLMs themselves more fluent in less widely spoken languages. One approach is to modify the tokeniser, the part of an LLM that chops words into smaller chunks for the rest of the model to manipulate. Text in Devanagari, a script used with Hindi, needs three to four times more tokens, when tokenised the standard way, than the same text in English. An Indian startup called Sarvam AI has written a tokeniser optimised for Hindi, which cuts that number substantially. Fewer tokens means fewer computations. Sarvam reckons that OpenHathi, its Devanagari-optimised LLM, can cut the cost of answering questions by around three-quarters.
Another is to improve the datasets on which LLMs are trained. Often this means digitising reams of pen-and-paper texts. In November a team of researchers at Mohamed bin Zayed University, in Abu Dhabi, released the latest version of an Arabic-speaking model called “Jais”. It has one-sixth as many parameters (one measure of a model’s size) as ChatGPT-3, but performs on par with it in Arabic. Timothy Baldwin, the university’s acting provost, notes that, because his team could only digitise so much Arabic text, the model also included some English. Some concepts, after all, are similar across all languages, and can be learned in any tongue. Data in a specific language are more important for teaching the model specific cultural ideas and quirks.
The third approach is to tweak models after they have been trained. Both Jais and OpenHathi have had some question-and-answer pairs hand crafted by humans. The same happens with Western chatbots, to stop them spreading what their makers see as disinformation. Ernie Bot, an LLM from Baidu, a big Chinese tech company, has been tweaked to try to stop it saying things to which the government might object. Models can also learn from human feedback, in which users rate an LLM’s answers. But that is hard to do for many poor-world languages, says Dr Baldwin, since it requires recruiting people literate enough to criticise the machine’s writing.
How well all this will work remains to be seen. A quarter of India’s adults are illiterate, something that no amount of LLM tweaking will solve. Many Indians prefer using voice messages to communicate rather than text ones. AI can also turn speech into words, as India’s chatbot for farmers does. But that adds another step at which errors can creep in.
And it is possible that builders of local LLMs may eventually be put out of business by the efforts of the Silicon Valley big boys. Although it is far from perfect, ChatGPT-4 is much better than ChatGPT-3 at answering questions in non-English languages. However it is done, teaching AI to speak more of the world’s 7,000-odd languages can only be a good thing.
Curious about the world? To enjoy our mind-expanding science coverage, sign up to Simply Science, our weekly subscriber-only newsletter.
© 2024, The Economist Newspaper Limited. All rights reserved. From The Economist, published under licence. The original content can be found on www.economist.com
Milestone Alert!
Livemint tops charts as the fastest growing news website in the world 🌏 Click here to know more.
Unlock a world of Benefits! From insightful newsletters to real-time stock tracking, breaking news and a personalized newsfeed – it’s all here, just a click away! Login Now!
Download The Mint News App to get Daily Market Updates.
Published: 26 Mar 2024, 03:12 PM IST
-
Cryptocurrency7 days agoIlluminating progress: Is a $140K income ‘poor’? – Crypto News
-
Technology6 days ago
Crypto Lawyer Bill Morgan Praises Ripple’s Multi-Chain Strategy as RLUSD Hits $1.1B – Crypto News
-
Blockchain5 days agoAnalyst Reveals What You Should Look Out For – Crypto News
-
Technology1 week agoSamsung Galaxy S25 Ultra 5G for under ₹80,000 on Flipkart? Here’s how the deal works – Crypto News
-
others7 days agoGold holds strong at $4,200 as Fed-cut anticipation builds – Crypto News
-
Cryptocurrency1 week ago
Crypto Platform Polymarket Relaunches in U.S. Following CFTC Approval – Crypto News
-
Cryptocurrency1 week agoUK recognises crypto as property in major digital asset shift – Crypto News
-
others1 week ago
Bitcoin Price Forecast as BlackRock Sends $125M in BTC to Coinbase — Is a Crash Inevitable? – Crypto News
-
Cryptocurrency7 days agoCrypto Holiday Gift Guide 2025 – Crypto News
-
others5 days ago
Breaking: Labor Department Cancels October PPI Inflation Report Ahead of FOMC Meeting – Crypto News
-
Cryptocurrency5 days agoArgentina moves to reshape crypto rules as banks prepare for Bitcoin services – Crypto News
-
Blockchain4 days agoStripe and Paradigm Open Tempo Blockchain Project to Public – Crypto News
-
Cryptocurrency1 week ago‘Get it done on time’ – Lawmakers push regulators on GENIUS Act rollout – Crypto News
-
Business1 week ago
Crypto Platform Polymarket Relaunches in U.S. Following CFTC Approval – Crypto News
-
Technology1 week agoWorking on a screen all day? These 8 LED monitors in Dec 2025 are kinder on your eyes – Crypto News
-
others7 days ago
Morgan Stanley Turns Bullish, Says Fed Will Cut Rates by 25bps This Month – Crypto News
-
Cryptocurrency7 days agoFlorida Appeals Court Revives $80M Bitcoin Theft – Crypto News
-
Cryptocurrency1 week agoBTC staking platform Babylon teams up with Aave for Bitcoin-backed DeFi insurance – Crypto News
-
Blockchain1 week agoSolana (SOL) Cools Off After Rally While Market Eyes a Resistance Break – Crypto News
-
others1 week ago
XRP Price Prediction As Spot ETF Inflows Near $1 Billion: What’s Next? – Crypto News
-
others1 week agoThe rally to 7120 continues – Crypto News
-
Blockchain7 days agoBitcoin Buries The Tulip Myth After 17 Years: Balchunas – Crypto News
-
Cryptocurrency6 days agoWhy Ethereum strengthens despite whale selling – Inside Asia premium twist – Crypto News
-
others6 days agoNasdaq futures hold key structure as price compresses toward major resistance zones – Crypto News
-
others6 days agoNasdaq futures hold key structure as price compresses toward major resistance zones – Crypto News
-
Blockchain4 days agoBMW Helps JPMorgan Drive Blockchain-Based FX Payments – Crypto News
-
Blockchain1 week agoLedger Finds Chip Flaw Allowing Complete Phone Takeover – Crypto News
-
Business1 week ago
Kalshi, Robinhood and Crypto com Face Cease & Desist Order in Connecticut – Crypto News
-
Business1 week ago
What’s Next for Dogecoin Price After Whales Scoop 480M DOGE? – Crypto News
-
Cryptocurrency1 week agoCoinDCX data reveals India’s rising appetite for diversified digital assets – Crypto News
-
Technology1 week agoCloudflare Resolved Services Issues Caused by Software Update – Crypto News
-
others1 week ago
Colombia Consumer Price Index (YoY) below forecasts (5.45%) in November: Actual (5.3%) – Crypto News
-
Technology1 week ago
Solana Price Outlook: Reversal at Key Support Could Lead to $150 Target – Crypto News
-
Technology1 week agoFrom security camera to gaming hub: 6 Easy tricks to make your old smartphone genuinely useful again – Crypto News
-
others1 week agoCanadian Dollar soars after upbeat labor report – Crypto News
-
others1 week ago
$1.3T BPCE To Roll Out Bitcoin, Ethereum and Solana Trading For Clients – Crypto News
-
others6 days agoStocks survive PCE and consumer data – FOMC too? – Crypto News
-
Cryptocurrency6 days agoThursday links: Prediction markets, agent hackers, quantum risks – Crypto News
-
Technology6 days agoStarlink India pricing revealed: How much does monthly plan cost and what are its benefits? – Crypto News
-
Cryptocurrency1 week agoCayman Islands sees rising Web3 foundation activity – Crypto News
-
Technology1 week agoApple Watch’s latest update drops a lifesaving feature for Indian users: here’s how it works – Crypto News
-
Metaverse1 week agoBetter Tomorrow: How OpenAI is reimagining education and inclusion for the digital age – Crypto News
-
Business1 week ago
Bitcoin, ETH, XRP, SOL’s Max Pain Price as Over $4B Options to Expire – Crypto News
-
Business1 week ago
Is ZCash Price Set for a Bigger Rally After Its 10% Surge on the Bitget Listing? – Crypto News
-
Cryptocurrency1 week agoGlassnode report reveals Bitcoin’s growing stability amid ETF activity and RWA expansion – Crypto News
-
Technology1 week ago
Peter Brandt Hints at Further Downside for Bitcoin After Brief Rebound – Crypto News
-
others7 days ago
United States Consumer Credit Change came in at $9.18B, below expectations ($10.5B) in October – Crypto News
-
Technology6 days agoTier 2 and Tier 3 cities drive over 90% of engagement on audio social platforms in India, says report – Crypto News
-
Blockchain6 days agoBittensor Set for First TAO Halving on Dec. 14 – Crypto News
-
Blockchain6 days agoBitcoin Santa Rally Talk Meets Last FOMC of 2025 – Crypto News
