

Metaverse
Why AI needs to learn new languages – Crypto News
OpenAI has not revealed much about how ChatGPT-4 was built. But a look at its predecessor, ChatGPT-3, is suggestive. Large language models (LLMs) are trained on text scraped from the internet, on which English is the lingua franca. Around 93% of ChatGPT-3’s training data was in English. In Common Crawl, just one of the datasets on which the model was trained, English makes up 47% of the corpus, with other (mostly related) European languages accounting for 38% more. Chinese and Japanese combined, by contrast, made up just 9%. Telugu was not even a rounding error.
View Full Image
An evaluation by Nathaniel Robinson, a researcher at Johns Hopkins University, and his colleagues finds that is not a problem limited to ChatGPT. All LLMs fare better with “high-resource” languages, for which training data are plentiful, than for “low-resource” ones for which they are scarce. That is a problem for those hoping to export AI to poor countries, in the hope it might improve everything from schools to health care. Researchers around the world are therefore working to make AI more multilingual.
India’s government is particularly keen. Many of its public services are already digitised, and it is keen to fortify them with AI. In September, for instance, it launched a chatbot to help farmers get information about state benefits.
The bot works by welding two sorts of language model together, says Shankar Maruwada of the EkStep Foundation, a non-profit that helped build it. Users can submit queries in their native tongues. (Eight are supported so far; five more are coming soon.) These are passed to a piece of machine-translation software developed at IIT Madras, an Indian academic institution, which translates them into English. The English version of the question is then fed to the LLM, and its response translated back into the user’s mother tongue.
The system seems to work. But translating queries into an LLM’s preferred language is a rather clumsy workaround. After all, language is a vehicle for worldviews and culture as well as just meaning, notes the boss of one Indian AI firm. A paper by Rebecca Johnson, a researcher at the University of Sydney, published in 2022, found that ChatGPT-3 gave replies on topics such as gun control and refugee policy that aligned most with the values displayed by Americans in the World Values Survey, a global questionnaire of public opinion.
Many researchers are therefore trying to make LLMs themselves more fluent in less widely spoken languages. One approach is to modify the tokeniser, the part of an LLM that chops words into smaller chunks for the rest of the model to manipulate. Text in Devanagari, a script used with Hindi, needs three to four times more tokens, when tokenised the standard way, than the same text in English. An Indian startup called Sarvam AI has written a tokeniser optimised for Hindi, which cuts that number substantially. Fewer tokens means fewer computations. Sarvam reckons that OpenHathi, its Devanagari-optimised LLM, can cut the cost of answering questions by around three-quarters.
Another is to improve the datasets on which LLMs are trained. Often this means digitising reams of pen-and-paper texts. In November a team of researchers at Mohamed bin Zayed University, in Abu Dhabi, released the latest version of an Arabic-speaking model called “Jais”. It has one-sixth as many parameters (one measure of a model’s size) as ChatGPT-3, but performs on par with it in Arabic. Timothy Baldwin, the university’s acting provost, notes that, because his team could only digitise so much Arabic text, the model also included some English. Some concepts, after all, are similar across all languages, and can be learned in any tongue. Data in a specific language are more important for teaching the model specific cultural ideas and quirks.
The third approach is to tweak models after they have been trained. Both Jais and OpenHathi have had some question-and-answer pairs hand crafted by humans. The same happens with Western chatbots, to stop them spreading what their makers see as disinformation. Ernie Bot, an LLM from Baidu, a big Chinese tech company, has been tweaked to try to stop it saying things to which the government might object. Models can also learn from human feedback, in which users rate an LLM’s answers. But that is hard to do for many poor-world languages, says Dr Baldwin, since it requires recruiting people literate enough to criticise the machine’s writing.
How well all this will work remains to be seen. A quarter of India’s adults are illiterate, something that no amount of LLM tweaking will solve. Many Indians prefer using voice messages to communicate rather than text ones. AI can also turn speech into words, as India’s chatbot for farmers does. But that adds another step at which errors can creep in.
And it is possible that builders of local LLMs may eventually be put out of business by the efforts of the Silicon Valley big boys. Although it is far from perfect, ChatGPT-4 is much better than ChatGPT-3 at answering questions in non-English languages. However it is done, teaching AI to speak more of the world’s 7,000-odd languages can only be a good thing.
Curious about the world? To enjoy our mind-expanding science coverage, sign up to Simply Science, our weekly subscriber-only newsletter.
© 2024, The Economist Newspaper Limited. All rights reserved. From The Economist, published under licence. The original content can be found on www.economist.com
Milestone Alert!
Livemint tops charts as the fastest growing news website in the world 🌏 Click here to know more.
Unlock a world of Benefits! From insightful newsletters to real-time stock tracking, breaking news and a personalized newsfeed – it’s all here, just a click away! Login Now!
Download The Mint News App to get Daily Market Updates.
Published: 26 Mar 2024, 03:12 PM IST
-
Cryptocurrency1 week ago
The monetary power of the periphery: How Dallas defends the dollar – Crypto News
-
Cryptocurrency1 week ago
Nifty 50 Ends Higher After Two-Day Drop, But Bulls Struggle to Break 25,000 – Crypto News
-
others1 week ago
Gold surges above $3,300 as US jobs data disappoints, Trump tariffs blocked – Crypto News
-
others1 week ago
Trader Michaël van de Poppe Says Ethereum-Based Altcoin Primed To Do Well in Coming Months, Updates Outlook on Bitcoin and Sui – Crypto News
-
Technology1 week ago
Why Is Pepe Coin Trending Today? – Crypto News
-
Blockchain1 week ago
Testing Strength At Key Support – Crypto News
-
Technology1 week ago
Cool savings for a hot season: Top 10 deals for you on ACs, refrigerators, microwaves, and more with up to 60% off – Crypto News
-
Cryptocurrency1 week ago
One day left to invest in Bitcoin Pepe before it hits centralised exchanges – Crypto News
-
Cryptocurrency1 week ago
SOL Strategies Files $1B Shelf Prospectus to Boost Solana Investment ‘Flexibility’ – Crypto News
-
Technology1 week ago
WhatsApp Status gets new Instagram-like features: Here’s what’s new – Crypto News
-
Cryptocurrency7 days ago
Can Shiba Inu Price Recover as Age Consumed & Falling MVRV Signal Bottom? – Crypto News
-
Blockchain7 days ago
Czech Justice Minister Resigns Over $45M Bitcoin Donation Scandal – Crypto News
-
Blockchain1 week ago
Bitcoin $106,800 Support Retest To Determine BTC’s Next Move – Crypto News
-
Cryptocurrency1 week ago
Litecoin price forecast: tracking LTC’s bullish technical setup – Crypto News
-
Cryptocurrency1 week ago
Litecoin price forecast: tracking LTC’s bullish technical setup – Crypto News
-
Cryptocurrency1 week ago
Cold Summer? Bitcoin Price Breaches $105K Support As Tariffs Return to Play – Crypto News
-
Business1 week ago
Sharplink Gaming Files $1 Billion Shelf Offering To Purchase Ethereum – Crypto News
-
others1 week ago
Sharplink Gaming Files $1 Billion Shelf Offering To Purchase Ethereum – Crypto News
-
Cryptocurrency7 days ago
Bitcoin in ‘make or break’ zone – Trump Media hints at what’s next – Crypto News
-
Blockchain7 days ago
Bitcoin Still Bullish, But $200,000 Off The Table And $137,000 In Sight – Crypto News
-
Technology7 days ago
Just-In: IMF Raises Red Flag Over Pakistan’s Bitcoin Mining Plans, Is $1.5B IMF Loan at Risk? – Crypto News
-
Blockchain1 week ago
RBI Expands Digital Rupee Pilots, UPI Leads Global Real-Time Payments – Crypto News
-
Blockchain1 week ago
Telegram raises $1.7 billion via bond offering – Crypto News
-
Cryptocurrency1 week ago
XRP futures surge past $223M as price holds $2.27 support – Crypto News
-
others1 week ago
Bankrupt Crypto Exchange FTX Officially Kicks Off Second Round of Creditor Repayments With $5,400,000,000 Distribution – Crypto News
-
others1 week ago
JPMorgan Chase CEO Warns US Bond Crisis Coming After Massive Money Printing, Says Regulators Will Panic – Crypto News
-
others6 days ago
‘Nothing Stops This Train’ – Macro Guru Lyn Alden Warns Fed Has No Way To Slow Down Debt Growth in US Financial System – Crypto News
-
others1 week ago
Gold rebounds as US Dollar retreats while court strikes down Trump’s tariffs – Crypto News
-
Blockchain1 week ago
Ethereum Price Faces Mild Correction — Support Levels in Focus – Crypto News
-
Business1 week ago
XRP Crash: Why Price Is Falling Today? – Crypto News
-
Business1 week ago
Floki Inu Announces Valhalla Mainnet Launch Date; FLOKI Price to Rally? – Crypto News
-
Metaverse1 week ago
IndiaAI Mission gets 16,000 new GPUs, three more foundational models – Crypto News
-
others1 week ago
$413,200,000,000 in Unrealized Losses Hit US Banks As FDIC Warns Rising Rates Adding Pressure – Crypto News
-
Technology1 week ago
What’s Behind the Crypto Price Drop: BTC, ETH, DOGE, XRP Down – Crypto News
-
Cryptocurrency1 week ago
Friday Charts: Click here for good news – Crypto News
-
Blockchain7 days ago
Major crypto hacks fell 40% in May, says PeckShield – Crypto News
-
Business7 days ago
Michael Saylor Signals Another Massive Strategy Bitcoin Purchase – Crypto News
-
Business6 days ago
XRP Las Vegas: Brad Garlinghouse Says Bitcoin Is Not The Enemy – Crypto News
-
Blockchain6 days ago
Strategy signals another Bitcoin buy on June 2 – Crypto News
-
Cryptocurrency6 days ago
Ethereum’s Pectra Upgrade leaves massive loophole for scammers – Crypto News
-
Cryptocurrency3 days ago
Shiba Inu burn surges 2,408%: Can SHIB finally escape bearish pressure? – Crypto News
-
Technology1 week ago
Solana’s Downfall Could Fuel Ethereum Price Rally to $3,500 – Crypto News
-
Business1 week ago
Trump Tariffs Struck Down By US Courts, ‘Buy Everything’ Says Arthur Hayes – Crypto News
-
Blockchain1 week ago
Crypto lobby group says SEC should back off regulating most DAOs – Crypto News
-
others1 week ago
Trader Who Called 2021 Bitcoin and Crypto Collapse Says Key Indicator Now Flashing Green – Crypto News
-
Blockchain1 week ago
Bitcoin $106,800 Support Retest To Determine BTC’s Next Move – Crypto News
-
Business1 week ago
Is Meta Adopting Bitcoin? What’s behind Strive CEO and Mark Zuckerberg Meeting – Crypto News
-
others1 week ago
Ethereum Price Eyes $3,000 as Whales Accumulate 190,000 ETH – Crypto News
-
Cryptocurrency1 week ago
Quant (QNT) rally pauses at $123, Sell-off or surge ahead? – Crypto News
-
others1 week ago
Crypto Couple Kidnapped In Argentina, Freed After $43K Ransom – Crypto News