

Metaverse
Why AI needs to learn new languages – Crypto News
OpenAI has not revealed much about how ChatGPT-4 was built. But a look at its predecessor, ChatGPT-3, is suggestive. Large language models (LLMs) are trained on text scraped from the internet, on which English is the lingua franca. Around 93% of ChatGPT-3’s training data was in English. In Common Crawl, just one of the datasets on which the model was trained, English makes up 47% of the corpus, with other (mostly related) European languages accounting for 38% more. Chinese and Japanese combined, by contrast, made up just 9%. Telugu was not even a rounding error.
View Full Image
An evaluation by Nathaniel Robinson, a researcher at Johns Hopkins University, and his colleagues finds that is not a problem limited to ChatGPT. All LLMs fare better with “high-resource” languages, for which training data are plentiful, than for “low-resource” ones for which they are scarce. That is a problem for those hoping to export AI to poor countries, in the hope it might improve everything from schools to health care. Researchers around the world are therefore working to make AI more multilingual.
India’s government is particularly keen. Many of its public services are already digitised, and it is keen to fortify them with AI. In September, for instance, it launched a chatbot to help farmers get information about state benefits.
The bot works by welding two sorts of language model together, says Shankar Maruwada of the EkStep Foundation, a non-profit that helped build it. Users can submit queries in their native tongues. (Eight are supported so far; five more are coming soon.) These are passed to a piece of machine-translation software developed at IIT Madras, an Indian academic institution, which translates them into English. The English version of the question is then fed to the LLM, and its response translated back into the user’s mother tongue.
The system seems to work. But translating queries into an LLM’s preferred language is a rather clumsy workaround. After all, language is a vehicle for worldviews and culture as well as just meaning, notes the boss of one Indian AI firm. A paper by Rebecca Johnson, a researcher at the University of Sydney, published in 2022, found that ChatGPT-3 gave replies on topics such as gun control and refugee policy that aligned most with the values displayed by Americans in the World Values Survey, a global questionnaire of public opinion.
Many researchers are therefore trying to make LLMs themselves more fluent in less widely spoken languages. One approach is to modify the tokeniser, the part of an LLM that chops words into smaller chunks for the rest of the model to manipulate. Text in Devanagari, a script used with Hindi, needs three to four times more tokens, when tokenised the standard way, than the same text in English. An Indian startup called Sarvam AI has written a tokeniser optimised for Hindi, which cuts that number substantially. Fewer tokens means fewer computations. Sarvam reckons that OpenHathi, its Devanagari-optimised LLM, can cut the cost of answering questions by around three-quarters.
Another is to improve the datasets on which LLMs are trained. Often this means digitising reams of pen-and-paper texts. In November a team of researchers at Mohamed bin Zayed University, in Abu Dhabi, released the latest version of an Arabic-speaking model called “Jais”. It has one-sixth as many parameters (one measure of a model’s size) as ChatGPT-3, but performs on par with it in Arabic. Timothy Baldwin, the university’s acting provost, notes that, because his team could only digitise so much Arabic text, the model also included some English. Some concepts, after all, are similar across all languages, and can be learned in any tongue. Data in a specific language are more important for teaching the model specific cultural ideas and quirks.
The third approach is to tweak models after they have been trained. Both Jais and OpenHathi have had some question-and-answer pairs hand crafted by humans. The same happens with Western chatbots, to stop them spreading what their makers see as disinformation. Ernie Bot, an LLM from Baidu, a big Chinese tech company, has been tweaked to try to stop it saying things to which the government might object. Models can also learn from human feedback, in which users rate an LLM’s answers. But that is hard to do for many poor-world languages, says Dr Baldwin, since it requires recruiting people literate enough to criticise the machine’s writing.
How well all this will work remains to be seen. A quarter of India’s adults are illiterate, something that no amount of LLM tweaking will solve. Many Indians prefer using voice messages to communicate rather than text ones. AI can also turn speech into words, as India’s chatbot for farmers does. But that adds another step at which errors can creep in.
And it is possible that builders of local LLMs may eventually be put out of business by the efforts of the Silicon Valley big boys. Although it is far from perfect, ChatGPT-4 is much better than ChatGPT-3 at answering questions in non-English languages. However it is done, teaching AI to speak more of the world’s 7,000-odd languages can only be a good thing.
Curious about the world? To enjoy our mind-expanding science coverage, sign up to Simply Science, our weekly subscriber-only newsletter.
© 2024, The Economist Newspaper Limited. All rights reserved. From The Economist, published under licence. The original content can be found on www.economist.com
Milestone Alert!
Livemint tops charts as the fastest growing news website in the world 🌏 Click here to know more.
Unlock a world of Benefits! From insightful newsletters to real-time stock tracking, breaking news and a personalized newsfeed – it’s all here, just a click away! Login Now!
Download The Mint News App to get Daily Market Updates.
Published: 26 Mar 2024, 03:12 PM IST
-
Blockchain1 week ago
Change In US Crypto Laws May Affect Charges In Do Kwon’s Criminal Case – Crypto News
-
others1 week ago
Gold retreats while Fed Powell and President Trump clash over interest rates – Crypto News
-
Technology1 week ago
Branded smartwatches under ₹5000 for style and functionality: Top 10 picks for everyday wear – Crypto News
-
Blockchain1 week ago
Best Crypto to Buy as Polymarket Nears $1B Valuation – Crypto News
-
others1 week ago
Tariffs may be adding a quarter of a percentage point to inflation right now – Crypto News
-
Technology1 week ago
Best laptops under ₹40,000 (June 2025) with latest processors, SSD storage, and Windows 11 features, Top 10 picks – Crypto News
-
Technology1 week ago
Top 10 air coolers for monsoon: Handpicked products for effective cooling from trusted brands – Crypto News
-
Cryptocurrency1 week ago
SHIB Price Prediction for June 26 – Crypto News
-
others1 week ago
EUR/JPY steadies near 169.00 as traders await the next catalyst – Crypto News
-
Cryptocurrency6 days ago
Friday charts: Retail is one-upping Wall Street – Crypto News
-
Cryptocurrency1 week ago
Bitcoin rallies to $106K on Mideast ceasefire news; Circle shares continue explosive climb – Crypto News
-
Cryptocurrency1 week ago
What next for XRP after breaking above the $2.15 resistance? – Crypto News
-
Technology1 week ago
US judge rules Anthropic’s use of books for AI training is fair use: All you need to know – Crypto News
-
Blockchain1 week ago
Bitcoin Price Could Rally To $110,000 ATH As These Macroeconomic Factors Align – Crypto News
-
Blockchain1 week ago
Cutting Block Times To Boost Performance – Crypto News
-
others1 week ago
Bank of America, Netflix and Apple Customers Targeted by Widescale Google Search Scams: Report – Crypto News
-
Technology1 week ago
OpenAI and Jony Ive’s AI hardware ambitions hit roadblock over trademark dispute: Report – Crypto News
-
Technology1 week ago
Turkey plans stricter crypto rules to fight money laundering – Crypto News
-
others1 week ago
Winnebago Industries (WGO) tops Q3 earnings estimates – Crypto News
-
Cryptocurrency1 week ago
US Housing Chief Orders Fannie Mae, Freddie Mac to Prepare for Crypto Assessment in Mortgages – Crypto News
-
others7 days ago
AI-Focused Layer-1 Blockchain Altcoin SAHARA Flames Out Following New Binance Listing – Crypto News
-
others6 days ago
USD/INR drops to two-week low as Rupee gains on weak US Dollar – Crypto News
-
Cryptocurrency6 days ago
TRON price forecast as USDT supply surpasses $80 billion – Crypto News
-
others1 week ago
US stocks downplay Iran retaliation concerns as indices edge higher – Crypto News
-
Blockchain1 week ago
Taker Buy Volume Spikes Sharply – Crypto News
-
Cryptocurrency1 week ago
Solana-based StarFun lets projects raise capital with crypto – Crypto News
-
De-fi1 week ago
Synaptogenix Acquires Bittensor’s TAO for AI Crypto Treasury – Crypto News
-
others1 week ago
Right now, we’re in watch and wait mode – Crypto News
-
De-fi1 week ago
Barclays to Ban Crypto Purchases via Credit Card – Crypto News
-
De-fi1 week ago
Sei Soars 70% as Wallet Growth and On-Chain Activity Hit New Highs – Crypto News
-
Technology1 week ago
Too many messages to read? WhatsApp launches AI-enabled summary feature for unread chats. Here’s how it works – Crypto News
-
Blockchain1 week ago
Breakout To $2,800 Or Crash To $2,000? – Crypto News
-
Technology1 week ago
Microsoft launches Mu AI model for smart local tasks on Windows PCs – Crypto News
-
De-fi1 week ago
Russia’s Central Bank Pushes CBDC Launch to 2026 – Crypto News
-
Cryptocurrency7 days ago
Wormhole price jumps 12% amid Ripple’s XRPL integration – Crypto News
-
De-fi7 days ago
U.S Judge Denies Ripple-SEC Request to Lift Injunction and Reduce $125 Million Fine – Crypto News
-
Cryptocurrency6 days ago
Vodafone Share Price Tests 78p Ahead of July Earnings, Is a Breakout Imminent? – Crypto News
-
others1 week ago
Bitcoin (BTC) and Ethereum (ETH) Lead $1,240,000,000 of Inflows to Crypto Products Despite Geopolitical Tensions: CoinShares – Crypto News
-
others1 week ago
German IFO Business Climate Index rises further to 88.4 in June vs. 88.3 expected – Crypto News
-
others1 week ago
Jerome Powell testifies Fed is well-positioned to wait to learn more about economy – Crypto News
-
Blockchain1 week ago
Aptos and Jump Crypto Launch Shelby, a Web3 Cloud Storage Platform – Crypto News
-
De-fi1 week ago
Dragonfly-Backed Codex Launches Blockchain for Stablecoins with Native USDC Support – Crypto News
-
Technology1 week ago
US judge rules Anthropic’s use of books for AI training is fair use: All you need to know – Crypto News
-
others1 week ago
United States 2-Year Note Auction fell from previous 3.955% to 3.786% – Crypto News
-
Blockchain1 week ago
Many Senators Absent From ‘Bipartisan’ Crypto Market Structure Hearing – Crypto News
-
Blockchain1 week ago
Bunker Buster: Ethereum Titans Stake $100 Million Amid US-Iran Hostilities – Crypto News
-
others1 week ago
Australian Dollar advances as US Dollar struggles following Israel-Iran ceasefire – Crypto News
-
Cryptocurrency1 week ago
BTC holds $106K; analysts point to institutional integration, on-chain innovation – Crypto News
-
Cryptocurrency1 week ago
XRP crashes 12.5% in TVL, ETF delay and war fears trigger selloff – Crypto News
-
others1 week ago
Employee at Billion Dollar Bank Embezzles $44,000 From Customer Accounts Before Being Banned From Industry – Crypto News