Metaverse
Why AI needs to learn new languages – Crypto News
OpenAI has not revealed much about how ChatGPT-4 was built. But a look at its predecessor, ChatGPT-3, is suggestive. Large language models (LLMs) are trained on text scraped from the internet, on which English is the lingua franca. Around 93% of ChatGPT-3’s training data was in English. In Common Crawl, just one of the datasets on which the model was trained, English makes up 47% of the corpus, with other (mostly related) European languages accounting for 38% more. Chinese and Japanese combined, by contrast, made up just 9%. Telugu was not even a rounding error.
View Full Image
An evaluation by Nathaniel Robinson, a researcher at Johns Hopkins University, and his colleagues finds that is not a problem limited to ChatGPT. All LLMs fare better with “high-resource” languages, for which training data are plentiful, than for “low-resource” ones for which they are scarce. That is a problem for those hoping to export AI to poor countries, in the hope it might improve everything from schools to health care. Researchers around the world are therefore working to make AI more multilingual.
India’s government is particularly keen. Many of its public services are already digitised, and it is keen to fortify them with AI. In September, for instance, it launched a chatbot to help farmers get information about state benefits.
The bot works by welding two sorts of language model together, says Shankar Maruwada of the EkStep Foundation, a non-profit that helped build it. Users can submit queries in their native tongues. (Eight are supported so far; five more are coming soon.) These are passed to a piece of machine-translation software developed at IIT Madras, an Indian academic institution, which translates them into English. The English version of the question is then fed to the LLM, and its response translated back into the user’s mother tongue.
The system seems to work. But translating queries into an LLM’s preferred language is a rather clumsy workaround. After all, language is a vehicle for worldviews and culture as well as just meaning, notes the boss of one Indian AI firm. A paper by Rebecca Johnson, a researcher at the University of Sydney, published in 2022, found that ChatGPT-3 gave replies on topics such as gun control and refugee policy that aligned most with the values displayed by Americans in the World Values Survey, a global questionnaire of public opinion.
Many researchers are therefore trying to make LLMs themselves more fluent in less widely spoken languages. One approach is to modify the tokeniser, the part of an LLM that chops words into smaller chunks for the rest of the model to manipulate. Text in Devanagari, a script used with Hindi, needs three to four times more tokens, when tokenised the standard way, than the same text in English. An Indian startup called Sarvam AI has written a tokeniser optimised for Hindi, which cuts that number substantially. Fewer tokens means fewer computations. Sarvam reckons that OpenHathi, its Devanagari-optimised LLM, can cut the cost of answering questions by around three-quarters.
Another is to improve the datasets on which LLMs are trained. Often this means digitising reams of pen-and-paper texts. In November a team of researchers at Mohamed bin Zayed University, in Abu Dhabi, released the latest version of an Arabic-speaking model called “Jais”. It has one-sixth as many parameters (one measure of a model’s size) as ChatGPT-3, but performs on par with it in Arabic. Timothy Baldwin, the university’s acting provost, notes that, because his team could only digitise so much Arabic text, the model also included some English. Some concepts, after all, are similar across all languages, and can be learned in any tongue. Data in a specific language are more important for teaching the model specific cultural ideas and quirks.
The third approach is to tweak models after they have been trained. Both Jais and OpenHathi have had some question-and-answer pairs hand crafted by humans. The same happens with Western chatbots, to stop them spreading what their makers see as disinformation. Ernie Bot, an LLM from Baidu, a big Chinese tech company, has been tweaked to try to stop it saying things to which the government might object. Models can also learn from human feedback, in which users rate an LLM’s answers. But that is hard to do for many poor-world languages, says Dr Baldwin, since it requires recruiting people literate enough to criticise the machine’s writing.
How well all this will work remains to be seen. A quarter of India’s adults are illiterate, something that no amount of LLM tweaking will solve. Many Indians prefer using voice messages to communicate rather than text ones. AI can also turn speech into words, as India’s chatbot for farmers does. But that adds another step at which errors can creep in.
And it is possible that builders of local LLMs may eventually be put out of business by the efforts of the Silicon Valley big boys. Although it is far from perfect, ChatGPT-4 is much better than ChatGPT-3 at answering questions in non-English languages. However it is done, teaching AI to speak more of the world’s 7,000-odd languages can only be a good thing.
Curious about the world? To enjoy our mind-expanding science coverage, sign up to Simply Science, our weekly subscriber-only newsletter.
© 2024, The Economist Newspaper Limited. All rights reserved. From The Economist, published under licence. The original content can be found on www.economist.com
Milestone Alert!
Livemint tops charts as the fastest growing news website in the world 🌏 Click here to know more.
Unlock a world of Benefits! From insightful newsletters to real-time stock tracking, breaking news and a personalized newsfeed – it’s all here, just a click away! Login Now!
Download The Mint News App to get Daily Market Updates.
Published: 26 Mar 2024, 03:12 PM IST
-
others1 week ago
$2.6 Billion in Bitcoin, ETH, XRP, Solana Options Expire Today, Experts Raise Concerns – Crypto News
-
Blockchain1 week agoBitcoin Treasury Co Strategy Announces $1.5B Convertible Note Buyback – Crypto News
-
others1 week ago
Why Bitcoin Price Could Reach $88,000 Despite Rising Odds Of Fed Rate Hikes – Crypto News
-
Business1 week ago
Why Is The Crypto Market Bleeding Today? – Crypto News
-
Blockchain1 week agoUS CLARITY Act Brings ‘Major Spike of Euphoria’ to Bitcoin: Santiment – Crypto News
-
Business1 week ago
CME and NYSE Push for U.S. Regulatory Oversight of Hyperliquid – Crypto News
-
Blockchain1 week agoEthereum Sell Signal That Last Preceded A 63% Drop Flashes Again – Crypto News
-
others1 week agoFinancial Firm Hit by Major Cybersecurity Incident, Data of 123,158 Americans Potentially Exposed – Crypto News
-
Cryptocurrency1 week agoBitcoin has one level left before macro pressure opens the path to $75k as Treasury yields extend two-day correction – Crypto News
-
Cryptocurrency2 days agoHYPE’s path to $100 runs through Hyperliquid becoming crypto’s on-chain Wall Street platform – Crypto News
-
Cryptocurrency2 days agoHYPE’s path to $100 runs through Hyperliquid becoming crypto’s on-chain Wall Street platform – Crypto News
-
Technology1 week ago
Pi Network Price Prediction After Creator-Focused App Studio Upgrade – Crypto News
-
Cryptocurrency1 week agoInside Wadoozie, the $WADZ Mission Activating 48 States – Crypto News
-
Technology1 week ago
BREAKING: THORChain Suffers $10M Exploit Across Bitcoin, Ethereum, BSC, Base Chains – Crypto News
-
Technology1 week agoGoogle’s new Gemini Intelligence’s ‘advanced’ spec requirements may even exclude older Pixel and Samsung flagships – Crypto News
-
Business1 week ago
How High Will XRP Price Go After CME Adds Ripple to NASDAQ Crypto Index on June 8? – Crypto News
-
Cryptocurrency1 week agoBitcoin ETF flows reverse as funds shed $1B on inflation fears – Crypto News
-
Technology1 week agoAI job takeover fears rise: 10 human skills that machines may still struggle to replace – Crypto News
-
Technology1 week agoAI job takeover fears rise: 10 human skills that machines may still struggle to replace – Crypto News
-
others5 days agoSui Launches Gasless Stablecoin Transfers With Support From Fireblocks – Crypto News
-
others5 days agoSui Launches Gasless Stablecoin Transfers With Support From Fireblocks – Crypto News
-
Technology4 days ago
Breaking: Crypto Exchange Blockchain.com Secretly Files For IPO After Elon Musk’s SpaceX – Crypto News
-
De-fi3 days agoSEC Commissioner Hester Peirce Clarifies Distinction Between Tokenized Securities and Synthetic Instruments – Crypto News
-
De-fi1 week agoDeFi Yields Are Too Damn Low! Here’s Why – Crypto News
-
Technology1 week agoTech CEOs summoned to Congress for another hearing on social medias risks for children – Crypto News
-
Technology1 week ago
Just-In: Grayscale Files Amended S-1 For BNB Coin ETF With SEC – Crypto News
-
others1 week ago
Crypto Weekly Recap: CLARITY Advances, US Inflation Soars, Wall Street Raises COIN Stock Target, Strategy Resumes Bitcoin Buys – Crypto News
-
Business1 week ago
XRP Trading Volume Tops Bitcoin on Upbit as Hana Bank Acquires Stake in Dunamu – Crypto News
-
Blockchain1 week agoSolana Eyes $117 Breakout — If Bulls Can Crush This Key Resistance – Crypto News
-
Business1 week ago
Strategy’s STRC Draws $2 Billion In Capital To Buy More Bitcoin – Crypto News
-
Cryptocurrency1 week agoBitcoin ETF flows reverse as funds shed $1B on inflation fears – Crypto News
-
Blockchain1 week agoUS CLARITY Act Will Be a ‘Boon For Domestic Innovation’: A16z – Crypto News
-
Business1 week ago
Michael Saylor Teases ‘Big’ Bitcoin Buy For Strategy – Crypto News
-
Technology1 week agoJury rules against Elon Musk in his feud with OpenAI, saying he filed his lawsuit too late – Crypto News
-
others1 week ago
Goldman Sachs Closes Solana & XRP ETF Stake, Dumps 70% ETH ETF Holdings – Crypto News
-
Cryptocurrency7 days agoSpaceX IPO bets push valuation above $2 trillion on Hyperliquid – Crypto News
-
Technology5 days agoIndia needs dedicated AI law as current legal framework inadequate to tackle emerging risks: Cyber Expert Pavan Duggal – Crypto News
-
Technology5 days agoIndia needs dedicated AI law as current legal framework inadequate to tackle emerging risks: Cyber Expert Pavan Duggal – Crypto News
-
Blockchain5 days agoCrypto Access To Banks In Focus After Trump’s New Executive Order – Crypto News
-
Technology4 days agoApple adds two major health features in India: Know all about Sleep apnoea alerts and hearing tests – Crypto News
-
Metaverse4 days agoOpenAI might be filing to go public soon. How we got here. – Crypto News
-
Metaverse3 days agoAs OpenAI and Anthropic soar, where do India’s AI startups stand? – Crypto News
-
Cryptocurrency1 week agoUS Treasury yields surge to new highs as liquidity tightens, pushing Bitcoin back below $82,000 resistance – Crypto News
-
others1 week ago‘The Buildup Is Sincerely Strong’: Michaël van de Poppe Says Bitcoin Could See a Fast Move to a Four-Month High – Here Are His Targets – Crypto News
-
Cryptocurrency1 week agoHow CLARITY Act survived a chaotic Senate markup after Warren, Banks and Democrats tried to slow it down – Crypto News
-
Business1 week ago
Bitget Introduces Unified AI Trading Ecosystem, Surpasses 1M Users and $1.2B AI Agent Trading Volume – Crypto News
-
Blockchain1 week agoOpenAI and Malta Partner to Give All Citizens Free ChatGPT Plus Access – Crypto News
-
Technology1 week ago
Bhutan Official Speaks Up On Claims of Selling $1 Billion In Bitcoin – Crypto News
-
others1 week agoHackers Targeting 59 Banking, Fintech and Crypto Platforms, Stealing Credentials, PINs and More: Report – Crypto News
-
Blockchain1 week agoIf You’re Holding XRP, This Pundit Says You Should See This – Crypto News
