

Metaverse
Why AI needs to learn new languages – Crypto News
OpenAI has not revealed much about how ChatGPT-4 was built. But a look at its predecessor, ChatGPT-3, is suggestive. Large language models (LLMs) are trained on text scraped from the internet, on which English is the lingua franca. Around 93% of ChatGPT-3’s training data was in English. In Common Crawl, just one of the datasets on which the model was trained, English makes up 47% of the corpus, with other (mostly related) European languages accounting for 38% more. Chinese and Japanese combined, by contrast, made up just 9%. Telugu was not even a rounding error.
View Full Image
An evaluation by Nathaniel Robinson, a researcher at Johns Hopkins University, and his colleagues finds that is not a problem limited to ChatGPT. All LLMs fare better with “high-resource” languages, for which training data are plentiful, than for “low-resource” ones for which they are scarce. That is a problem for those hoping to export AI to poor countries, in the hope it might improve everything from schools to health care. Researchers around the world are therefore working to make AI more multilingual.
India’s government is particularly keen. Many of its public services are already digitised, and it is keen to fortify them with AI. In September, for instance, it launched a chatbot to help farmers get information about state benefits.
The bot works by welding two sorts of language model together, says Shankar Maruwada of the EkStep Foundation, a non-profit that helped build it. Users can submit queries in their native tongues. (Eight are supported so far; five more are coming soon.) These are passed to a piece of machine-translation software developed at IIT Madras, an Indian academic institution, which translates them into English. The English version of the question is then fed to the LLM, and its response translated back into the user’s mother tongue.
The system seems to work. But translating queries into an LLM’s preferred language is a rather clumsy workaround. After all, language is a vehicle for worldviews and culture as well as just meaning, notes the boss of one Indian AI firm. A paper by Rebecca Johnson, a researcher at the University of Sydney, published in 2022, found that ChatGPT-3 gave replies on topics such as gun control and refugee policy that aligned most with the values displayed by Americans in the World Values Survey, a global questionnaire of public opinion.
Many researchers are therefore trying to make LLMs themselves more fluent in less widely spoken languages. One approach is to modify the tokeniser, the part of an LLM that chops words into smaller chunks for the rest of the model to manipulate. Text in Devanagari, a script used with Hindi, needs three to four times more tokens, when tokenised the standard way, than the same text in English. An Indian startup called Sarvam AI has written a tokeniser optimised for Hindi, which cuts that number substantially. Fewer tokens means fewer computations. Sarvam reckons that OpenHathi, its Devanagari-optimised LLM, can cut the cost of answering questions by around three-quarters.
Another is to improve the datasets on which LLMs are trained. Often this means digitising reams of pen-and-paper texts. In November a team of researchers at Mohamed bin Zayed University, in Abu Dhabi, released the latest version of an Arabic-speaking model called “Jais”. It has one-sixth as many parameters (one measure of a model’s size) as ChatGPT-3, but performs on par with it in Arabic. Timothy Baldwin, the university’s acting provost, notes that, because his team could only digitise so much Arabic text, the model also included some English. Some concepts, after all, are similar across all languages, and can be learned in any tongue. Data in a specific language are more important for teaching the model specific cultural ideas and quirks.
The third approach is to tweak models after they have been trained. Both Jais and OpenHathi have had some question-and-answer pairs hand crafted by humans. The same happens with Western chatbots, to stop them spreading what their makers see as disinformation. Ernie Bot, an LLM from Baidu, a big Chinese tech company, has been tweaked to try to stop it saying things to which the government might object. Models can also learn from human feedback, in which users rate an LLM’s answers. But that is hard to do for many poor-world languages, says Dr Baldwin, since it requires recruiting people literate enough to criticise the machine’s writing.
How well all this will work remains to be seen. A quarter of India’s adults are illiterate, something that no amount of LLM tweaking will solve. Many Indians prefer using voice messages to communicate rather than text ones. AI can also turn speech into words, as India’s chatbot for farmers does. But that adds another step at which errors can creep in.
And it is possible that builders of local LLMs may eventually be put out of business by the efforts of the Silicon Valley big boys. Although it is far from perfect, ChatGPT-4 is much better than ChatGPT-3 at answering questions in non-English languages. However it is done, teaching AI to speak more of the world’s 7,000-odd languages can only be a good thing.
Curious about the world? To enjoy our mind-expanding science coverage, sign up to Simply Science, our weekly subscriber-only newsletter.
© 2024, The Economist Newspaper Limited. All rights reserved. From The Economist, published under licence. The original content can be found on www.economist.com
Milestone Alert!
Livemint tops charts as the fastest growing news website in the world 🌏 Click here to know more.
Unlock a world of Benefits! From insightful newsletters to real-time stock tracking, breaking news and a personalized newsfeed – it’s all here, just a click away! Login Now!
Download The Mint News App to get Daily Market Updates.
Published: 26 Mar 2024, 03:12 PM IST
-
others1 week ago
Australia CFTC AUD NC Net Positions up to $-54.6K from previous $-58.8K – Crypto News
-
Cryptocurrency1 week ago
Coinbase tops Ethereum nodes with 11% stake – Crypto News
-
others1 week ago
Here’s How Bitcoin Could Boost Demand for US Treasuries, According to Macro Guru Luke Gromen – Crypto News
-
Blockchain1 week ago
Bitcoin Faces Make-Or-Break Battle As 1.76 Million BTC Cluster Between $94,125 And $99,150 – Crypto News
-
Business1 week ago
Solana Price Analysis: This $1 Billion SEC Filing Could Drive SOL Price to $250 in Q2 2025 – Crypto News
-
Blockchain6 days ago
Crypto token failures soar, with 1 in 4 launched since 2021 dying in Q1: CoinGecko – Crypto News
-
Technology5 days ago
Waymo, Toyota Partner on Self-Driving Tech for Personal Vehicles – Crypto News
-
Technology1 week ago
MicroStrategy And Metaplanet See $5.1 Billion Gain From Bitcoin Treasury Operations – Crypto News
-
others1 week ago
Justin Sun Reveals Why JUST Will Become The Next 100x Token – Crypto News
-
Cryptocurrency1 week ago
This Week in Crypto Games: Ubisoft’s ‘Might & Magic’, ‘Peaky Blinders’ in Development – Crypto News
-
Business1 week ago
Ethereum Price Now Targets $2,875 as Vitalik Buterin Responds to Cardano Founder with 800x Update – Crypto News
-
Blockchain1 week ago
Bitget takes legal action on alleged VOXEL futures price manipulation – Crypto News
-
Cryptocurrency1 week ago
Bitcoin Flashed “Greed” Near $95k: Are You Walking Into the Same Trap as Last Cycle? – Crypto News
-
Technology1 week ago
Uber CEO Says Robots Could Replace Human Drivers by 2040 – Crypto News
-
Cryptocurrency1 week ago
SEC delays decision on Franklin Templeton’s spot XRP ETF – Crypto News
-
others1 week ago
EUR/GBP holds positive ground near 0.8500 as traders await German, Eurozone GDP data – Crypto News
-
Cryptocurrency6 days ago
First 100 days under President Trump: crypto industry faces new challenges and opportunities – Crypto News
-
others1 week ago
United States CFTC Oil NC Net Positions up to 171K from previous 146.4K – Crypto News
-
Business1 week ago
Why Arbitrum-Nvidia Partnership Collapsed – And What It Means for Web3 – Crypto News
-
others1 week ago
Expert Predicts Start Date For Pi Network Price Pump, Here’s When – Crypto News
-
Business1 week ago
Expert Reveals 7 Pi Network Pros That Can Drive The Upcoming Pi Coin Rally – Crypto News
-
others1 week ago
IMF Warns Negative Supply Shock Incoming, Forecasts ‘Significant Slowdown’ of Global Economy – Crypto News
-
Cryptocurrency1 week ago
Ethereum Price Upward Momentum Wanes, Resistance Forms Near $1,800 – Crypto News
-
others1 week ago
Experts Predict US Recession in 2025 if Trump-China Trade War Tariffs Stay – Crypto News
-
Technology1 week ago
Presto Exec Peter Chung Reveals Bitcoin Price Target For 2025 – Crypto News
-
others1 week ago
Here’s why MGM stock is on the move – Crypto News
-
Cryptocurrency1 week ago
Tether USDT reserves surge on Binance – Is a market recovery incoming? – Crypto News
-
Blockchain1 week ago
Bitcoin Price Flashes Golden Cross That Only Happens Once Every Cycle, What To Expect – Crypto News
-
Technology1 week ago
3 USA Coins to Buy as Dollar Rebounds on Trump’s Auto Tariff Relief – Crypto News
-
Blockchain1 week ago
TON’s Broxus launches blockchain app scalability platform TON Factory – Crypto News
-
Blockchain7 days ago
Vitalik outlines vision as Ethereum ecosystem addresses hit new high – Crypto News
-
Cryptocurrency1 week ago
Coinbase moves to revive lawsuit against FDIC – Crypto News
-
Blockchain1 week ago
Bitcoin treasury firms driving $200T hyperbitcoinization — Adam Back – Crypto News
-
Blockchain1 week ago
Best Crypto to Buy as Derivatives Exchange CME Set to Launch XRP Futures – Crypto News
-
Technology1 week ago
Weekly Tech Recap: Sarvam AI to lead India’s LLM effort, buyers line up for Chrome and more – Crypto News
-
others1 week ago
Cardano Lace Wallet Integrates XRP, Fueling Potential Price Breakout Move – Crypto News
-
Blockchain1 week ago
Bitcoin price chart looks set for $100K, SUI, AVAX, TRUMP and TAO expected to follow – Crypto News
-
Blockchain1 week ago
Dogecoin Confirms Daily Trend Reversal With Breakout, Retest, And New Uptrend – Crypto News
-
Technology1 week ago
Bitcoin Price Today: BTC moves above $95K as Changpeng Zhao Sends “Buy-the-Dip” Signal – Crypto News
-
others1 week ago
Japanese Yen extends its consolidative price move; USD/JPY holds steady above mid-143.00s – Crypto News
-
Business1 week ago
Expert Reveals Why The Ethereum-To-Bitcoin Ratio Is Falling – Crypto News
-
Technology1 week ago
Was BTC Price Rally to $95K Based on False Hopes? Investors Doubtful of Trump-Xi Tariff Deal – Crypto News
-
Blockchain1 week ago
Stacks Asia expands Bitcoin initiatives with Abu Dhabi partnership – Crypto News
-
Business1 week ago
5 Ethereum Rivals to Buy to Turn $1K to $10K in May – Crypto News
-
Cryptocurrency1 week ago
‘All sizzle and no steak:’ 2 rival crypto firms duke it out in a court case – Crypto News
-
Technology1 week ago
OpenAI goes after Google’s golden egg with product marketing on ChatGPT – Crypto News
-
Technology1 week ago
Next-gen Apple headset incoming? Gurman says Vision Air aims high, weighs less – Crypto News
-
Blockchain1 week ago
A16z leads $25M funding for Miden blockchain project – Crypto News
-
Metaverse1 week ago
Your friend, girlfriend, therapist? What Mark Zuckerberg thinks about future of AI, Meta’s Llama AI app, more – Crypto News
-
Business1 week ago
PayPal PYUSD Hits Major Win As US SEC Drops Case – Crypto News