

Metaverse
Why AI needs to learn new languages – Crypto News
OpenAI has not revealed much about how ChatGPT-4 was built. But a look at its predecessor, ChatGPT-3, is suggestive. Large language models (LLMs) are trained on text scraped from the internet, on which English is the lingua franca. Around 93% of ChatGPT-3’s training data was in English. In Common Crawl, just one of the datasets on which the model was trained, English makes up 47% of the corpus, with other (mostly related) European languages accounting for 38% more. Chinese and Japanese combined, by contrast, made up just 9%. Telugu was not even a rounding error.
View Full Image
An evaluation by Nathaniel Robinson, a researcher at Johns Hopkins University, and his colleagues finds that is not a problem limited to ChatGPT. All LLMs fare better with “high-resource” languages, for which training data are plentiful, than for “low-resource” ones for which they are scarce. That is a problem for those hoping to export AI to poor countries, in the hope it might improve everything from schools to health care. Researchers around the world are therefore working to make AI more multilingual.
India’s government is particularly keen. Many of its public services are already digitised, and it is keen to fortify them with AI. In September, for instance, it launched a chatbot to help farmers get information about state benefits.
The bot works by welding two sorts of language model together, says Shankar Maruwada of the EkStep Foundation, a non-profit that helped build it. Users can submit queries in their native tongues. (Eight are supported so far; five more are coming soon.) These are passed to a piece of machine-translation software developed at IIT Madras, an Indian academic institution, which translates them into English. The English version of the question is then fed to the LLM, and its response translated back into the user’s mother tongue.
The system seems to work. But translating queries into an LLM’s preferred language is a rather clumsy workaround. After all, language is a vehicle for worldviews and culture as well as just meaning, notes the boss of one Indian AI firm. A paper by Rebecca Johnson, a researcher at the University of Sydney, published in 2022, found that ChatGPT-3 gave replies on topics such as gun control and refugee policy that aligned most with the values displayed by Americans in the World Values Survey, a global questionnaire of public opinion.
Many researchers are therefore trying to make LLMs themselves more fluent in less widely spoken languages. One approach is to modify the tokeniser, the part of an LLM that chops words into smaller chunks for the rest of the model to manipulate. Text in Devanagari, a script used with Hindi, needs three to four times more tokens, when tokenised the standard way, than the same text in English. An Indian startup called Sarvam AI has written a tokeniser optimised for Hindi, which cuts that number substantially. Fewer tokens means fewer computations. Sarvam reckons that OpenHathi, its Devanagari-optimised LLM, can cut the cost of answering questions by around three-quarters.
Another is to improve the datasets on which LLMs are trained. Often this means digitising reams of pen-and-paper texts. In November a team of researchers at Mohamed bin Zayed University, in Abu Dhabi, released the latest version of an Arabic-speaking model called “Jais”. It has one-sixth as many parameters (one measure of a model’s size) as ChatGPT-3, but performs on par with it in Arabic. Timothy Baldwin, the university’s acting provost, notes that, because his team could only digitise so much Arabic text, the model also included some English. Some concepts, after all, are similar across all languages, and can be learned in any tongue. Data in a specific language are more important for teaching the model specific cultural ideas and quirks.
The third approach is to tweak models after they have been trained. Both Jais and OpenHathi have had some question-and-answer pairs hand crafted by humans. The same happens with Western chatbots, to stop them spreading what their makers see as disinformation. Ernie Bot, an LLM from Baidu, a big Chinese tech company, has been tweaked to try to stop it saying things to which the government might object. Models can also learn from human feedback, in which users rate an LLM’s answers. But that is hard to do for many poor-world languages, says Dr Baldwin, since it requires recruiting people literate enough to criticise the machine’s writing.
How well all this will work remains to be seen. A quarter of India’s adults are illiterate, something that no amount of LLM tweaking will solve. Many Indians prefer using voice messages to communicate rather than text ones. AI can also turn speech into words, as India’s chatbot for farmers does. But that adds another step at which errors can creep in.
And it is possible that builders of local LLMs may eventually be put out of business by the efforts of the Silicon Valley big boys. Although it is far from perfect, ChatGPT-4 is much better than ChatGPT-3 at answering questions in non-English languages. However it is done, teaching AI to speak more of the world’s 7,000-odd languages can only be a good thing.
Curious about the world? To enjoy our mind-expanding science coverage, sign up to Simply Science, our weekly subscriber-only newsletter.
© 2024, The Economist Newspaper Limited. All rights reserved. From The Economist, published under licence. The original content can be found on www.economist.com
Milestone Alert!
Livemint tops charts as the fastest growing news website in the world 🌏 Click here to know more.
Unlock a world of Benefits! From insightful newsletters to real-time stock tracking, breaking news and a personalized newsfeed – it’s all here, just a click away! Login Now!
Download The Mint News App to get Daily Market Updates.
Published: 26 Mar 2024, 03:12 PM IST
-
Technology7 days ago
ChatGPT users are mass cancelling OpenAI subscriptions after GPT-5 launch: Here’s why – Crypto News
-
Technology5 days ago
iPhone 17 series tipped to cost more than iPhone 16: Here’s how much it could cost in India and US – Crypto News
-
Cryptocurrency6 days ago
XRP gains legal clarity in US after Ripple settles SEC case – Crypto News
-
Blockchain1 week ago
Crypto Market Might Be Undervalued Amid SEC’s New Stance – Crypto News
-
Metaverse1 week ago
ChatGPT won’t help you break up anymore as OpenAI tweaks rules – Crypto News
-
Blockchain1 week ago
Shiba Inu’s Shibarium Is More Than A Layer 2, Expert Reveals What It Is – Crypto News
-
Technology7 days ago
Humanoid Robots Still Lack AI Technology, Unitree CEO Says – Crypto News
-
Technology1 week ago
iPhone users alert! Truecaller to discontinue call recording feature for iOS from September 30. Here’s what you can do… – Crypto News
-
Technology1 week ago
iPhone users alert! Truecaller to discontinue call recording feature for iOS from September 30. Here’s what you can do… – Crypto News
-
others1 week ago
US President Trump issues executive order imposing additional 25% tariff on India – Crypto News
-
Business1 week ago
Analyst Predicts $4K Ethereum Rally as SEC Clarifies Liquid Staking Rules – Crypto News
-
Business1 week ago
XRP Price Prediction As $214B SBI Holdings Files for XRP ETF- Analyst Sees Rally to $4 Ahead – Crypto News
-
De-fi1 week ago
Ripple Expands Its Stablecoin Payments Infra with $200M Rail Acquisition – Crypto News
-
Cryptocurrency1 week ago
Harvard Reveals $116 Million Investment in BlackRock Bitcoin ETF – Crypto News
-
others7 days ago
SEC Latest Filing Reveal Ripple Case Win Could Trigger XRP Treasury Boom Like Ethereum – Crypto News
-
Cryptocurrency7 days ago
DWP Management Secures $200M in XRP Post SEC-Win – Crypto News
-
De-fi1 week ago
SEC Says Some Stablecoins Can Be Treated as Cash, but Experts Warn of Innovation Risk – Crypto News
-
De-fi1 week ago
Coinbase Pushes for ZK-enabled AML Overhaul Just Months After Data Breach – Crypto News
-
others1 week ago
EUR firmer but off overnight highs – Scotiabank – Crypto News
-
Blockchain1 week ago
Trump to Sign an EO Over Ideological Debanking: Report – Crypto News
-
others1 week ago
Ripple To Gobble Up Payments Platform Rail for $200,000,000 To Support Transactions via XRP and RLUSD Stablecoin – Crypto News
-
Technology1 week ago
Hulu app to shut down in 2026 as Disney fully merges platform into Disney+ – Crypto News
-
De-fi6 days ago
Circle Mints About $1 Billion in USDC After Flurry of Treasury Moves – Crypto News
-
Business6 days ago
Trump’s World Liberty Financial Targets $1.5B Crypto Vehicle Backed by WLFI Tokens – Crypto News
-
others6 days ago
United Kingdom CFTC GBP NC Net Positions fell from previous £-12K to £-33.3K – Crypto News
-
Technology1 week ago
SEC Clarifies Liquid Staking Isn’t a Security Amid Project Crypto Push – Crypto News
-
Metaverse1 week ago
OpenAI launches gpt-oss-120b and 20b models that can work without the Cloud on your computers – Crypto News
-
De-fi1 week ago
SEC Says Some Stablecoins Can Be Treated as Cash, but Experts Warn of Innovation Risk – Crypto News
-
Technology1 week ago
Shiba Inu Community Turns on Shytoshi Kusama, Says He’s “Unfit To Lead” Amid Elections – Crypto News
-
Business1 week ago
OpenAI Launches GPT-5 Amid Competition From Elon Musk’s Grok – Crypto News
-
Technology1 week ago
Breaking: XRP Lawsuit Ends as Ripple and SEC File Joint Dismissal – Crypto News
-
Cryptocurrency1 week ago
This Ripple (XRP) Metric Flashes Critical Warning Sign – Crypto News
-
Cryptocurrency6 days ago
BTC hovers at $115K; ETF flows turn negative, short-term holder profitability drops – Crypto News
-
others6 days ago
Crypto Adviser Bo Hines Move to AI Role Sparks Concern Over White House Policy Shift – Crypto News
-
Cryptocurrency5 days ago
Crypto investors hopeful amid new regulatory orders – Crypto News
-
others3 days ago
Breaking: USDC Issuer Circle To Launch Arc Blockchain for Stablecoin Payments – Crypto News
-
others1 week ago
Scammer Steals $507,916 From US Benefits Program, Leaving Dozens of Families in Financial Ruin: DOJ – Crypto News
-
Cryptocurrency1 week ago
Top Crypto Coin Prediction: BTC, ETH And TRX Defy Odds – Crypto News
-
De-fi1 week ago
SEC Says Some Stablecoins Can Be Treated as Cash, but Experts Warn of Innovation Risk – Crypto News
-
Technology1 week ago
Roman Storm Trial: Jury Fails to Reach Verdict on Money Laundering Charge – Crypto News
-
Technology1 week ago
‘The computer will see what we see’: Microsoft outlines bold AI shift in Windows 2030 vision – Crypto News
-
Cryptocurrency1 week ago
Ripple CTO Breaks Silence on Caitlin Long’s ‘XRP ICO’ Misconception – Crypto News
-
Technology1 week ago
Trump Tariffs: U.S. Imposes Another 25% Tariff on India – Crypto News
-
Cryptocurrency1 week ago
SEC staff statement on liquid staking may pave way for staking in spot Ether ETFs – Crypto News
-
others1 week ago
SEC Commissioner Pushes Back on Crypto ETFs as XRP ETF Approval Odds Sink – Crypto News
-
Business1 week ago
Former Meta, Netflix Engineers Unite to Build a Blockchain For iPhone Like Revolution in Web3 – Crypto News
-
Blockchain1 week ago
XRP Whales Offload $1.9B Putting Price at Risk of Drop Toward $2 – Crypto News
-
others1 week ago
Just In: Trump to Sign Executive Order Allowing Crypto in 401(k) Plans – Crypto News
-
others1 week ago
Plume and Mercado Bitcoin To Tokenize $500M Real-World Assets, Driving RWA Adoption Across Latin America – Crypto News
-
others1 week ago
Robinhood Lists FLOKI Meme Coin As Market Cap Surpasses $1B – Crypto News