{"id":430257,"date":"2026-06-11T09:00:09","date_gmt":"2026-06-11T03:30:09","guid":{"rendered":"https:\/\/dripp.zone\/news\/googles-diffusiongemma-ai-hits-1000-tokens-per-second-and-its-free-crypto-news-2\/"},"modified":"2026-06-11T09:22:44","modified_gmt":"2026-06-11T03:52:44","slug":"googles-diffusiongemma-ai-hits-1000-tokens-per-second-and-its-free-crypto-news-2","status":"publish","type":"post","link":"https:\/\/dripp.zone\/news\/googles-diffusiongemma-ai-hits-1000-tokens-per-second-and-its-free-crypto-news-2\/","title":{"rendered":"Google&#8217;s DiffusionGemma AI Hits 1,000 Tokens Per Second\u2014And It&#8217;s Free &#8211; Crypto News"},"content":{"rendered":"<p><\/p>\n<div style=\"position:relative;overflow:visible;font-size:1.2em;line-height:1.58\">\n<div class=\"pt-8 pb-10 border-t border-b border-decryptGridline \">\n<h4 class=\"sc-b2a202e4-4 bNRGqr gg-dark:text-white\" color=\"#333\">In brief<\/h4>\n<ul>\n<li class=\"font-meta-serif-pro font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">Google released DiffusionGemma, a free open-weight model that generates entire 256-token blocks simultaneously via text diffusion\u2014hitting over 1,000 tokens per second on an NVIDIA H100, four times faster than standard autoregressive models.<\/li>\n<li class=\"font-meta-serif-pro font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">The custom drafter module DiffusionGemma needs for local inference doesn&#8217;t exist in any public runtime yet\u2014not in mlx-lm, not in LM Studio\u2014making it effectively unrunnable on most consumer setups today.<\/li>\n<li class=\"font-meta-serif-pro font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">On NVIDIA NIM, the model arrived preconfigured at 8,192 tokens of context\u2014below the 64,000-token floor that agentic frameworks like Hermes Agent require\u2014meaning autonomous workflows won&#8217;t run without manual reconfiguration.<\/li>\n<\/ul>\n<\/div>\n<p class=\"font-meta-serif-pro scene:font-noto-sans scene:text-base scene:md:text-lg font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">Google dropped DiffusionGemma <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/huggingface.co\/google\/diffusiongemma-26B-A4B-it\" target=\"_blank\" rel=\"nofollow external noopener\" class=\"sc-adb616fe-0 bJsyml\">today<\/a>, an open model AI that generates text the way image generators create pictures: start with noise, refine until it makes sense. It hits 1,000 tokens per second on an NVIDIA H100. (Tokens are the basic unit of information that an AI model handles.) That means it\u2019s four times faster than regular Gemma. It\u2019s also free, Apache 2.0, with weights on Hugging Face.<\/p>\n<div>\n<figure class=\"w-full max-w-full mt-4 overflow-hidden\"><\/figure>\n<\/div>\n<p class=\"font-meta-serif-pro scene:font-noto-sans scene:text-base scene:md:text-lg font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">The catch, as always, is in the fine print. Per <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/blog.google\/innovation-and-ai\/technology\/developers-tools\/diffusion-gemma-faster-text-generation\/\" target=\"_blank\" class=\"sc-adb616fe-0 bJsyml\">Google&#8217;s announcement<\/a>, the model hits &#8220;700+ tokens per second on NVIDIA GeForce RTX 5090.&#8221; It also trails standard Gemma 4 on output quality.<\/p>\n<p class=\"font-meta-serif-pro scene:font-noto-sans scene:text-base scene:md:text-lg font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">Google says so themselves. This is a speed model, not a quality upgrade.<\/p>\n<h2 class=\"sc-b2a202e4-2 bmropA gg-dark:text-white scene:font-itc-avant-garde-gothic-pro scene:font-light\" style=\"margin-top:2em;text-align:left;padding-bottom:16px;margin-bottom:16px;border-bottom:1px solid #dfe2e4\" color=\"#333\">What this actually does<\/h2>\n<p class=\"font-meta-serif-pro scene:font-noto-sans scene:text-base scene:md:text-lg font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">Every LLM you&#8217;ve used is a typewriter. One token at a time with each word dependent on the last. That&#8217;s how autoregressive architectures work.<\/p>\n<p class=\"font-meta-serif-pro scene:font-noto-sans scene:text-base scene:md:text-lg font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">DiffusionGemma doesn&#8217;t do that. Instead of generating tokens sequentially, it starts with refined chunks of garbled text in parallel. Per Google&#8217;s <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/developers.googleblog.com\/en\/diffusiongemma-the-developer-guide\/\" target=\"_blank\" rel=\"nofollow external noopener\" class=\"sc-adb616fe-0 bJsyml\">developer guide<\/a>, it &#8220;starts with a canvas of random placeholder tokens&#8221; and iteratively locks in confident tokens until the whole block snaps into focus. Two hundred fifty-six tokens per forward pass. The GPU stays busy.<\/p>\n<div>\n<figure class=\"w-full max-w-full mt-4 overflow-hidden\"><img alt=\"\" loading=\"lazy\" width=\"1476\" height=\"850\" decoding=\"async\" data-nimg=\"1\" class=\"object-contain object-center w-full\" style=\"color:transparent\" sizes=\"auto, (min-width: 640px) 950px, 384px\" srcset=\"https:\/\/img.decrypt.co\/insecure\/rs:fit:16:0:0:0\/plain\/https:\/\/cdn.decrypt.co\/wp-content\/uploads\/2026\/06\/Captura-de-pantalla-2026-06-10-a-las-18.13.31.png@webp 16w, https:\/\/img.decrypt.co\/insecure\/rs:fit:32:0:0:0\/plain\/https:\/\/cdn.decrypt.co\/wp-content\/uploads\/2026\/06\/Captura-de-pantalla-2026-06-10-a-las-18.13.31.png@webp 32w, https:\/\/img.decrypt.co\/insecure\/rs:fit:48:0:0:0\/plain\/https:\/\/cdn.decrypt.co\/wp-content\/uploads\/2026\/06\/Captura-de-pantalla-2026-06-10-a-las-18.13.31.png@webp 48w, https:\/\/img.decrypt.co\/insecure\/rs:fit:64:0:0:0\/plain\/https:\/\/cdn.decrypt.co\/wp-content\/uploads\/2026\/06\/Captura-de-pantalla-2026-06-10-a-las-18.13.31.png@webp 64w, https:\/\/img.decrypt.co\/insecure\/rs:fit:96:0:0:0\/plain\/https:\/\/cdn.decrypt.co\/wp-content\/uploads\/2026\/06\/Captura-de-pantalla-2026-06-10-a-las-18.13.31.png@webp 96w, https:\/\/img.decrypt.co\/insecure\/rs:fit:128:0:0:0\/plain\/https:\/\/cdn.decrypt.co\/wp-content\/uploads\/2026\/06\/Captura-de-pantalla-2026-06-10-a-las-18.13.31.png@webp 128w, https:\/\/img.decrypt.co\/insecure\/rs:fit:256:0:0:0\/plain\/https:\/\/cdn.decrypt.co\/wp-content\/uploads\/2026\/06\/Captura-de-pantalla-2026-06-10-a-las-18.13.31.png@webp 256w, https:\/\/img.decrypt.co\/insecure\/rs:fit:384:0:0:0\/plain\/https:\/\/cdn.decrypt.co\/wp-content\/uploads\/2026\/06\/Captura-de-pantalla-2026-06-10-a-las-18.13.31.png@webp 384w, https:\/\/img.decrypt.co\/insecure\/rs:fit:640:0:0:0\/plain\/https:\/\/cdn.decrypt.co\/wp-content\/uploads\/2026\/06\/Captura-de-pantalla-2026-06-10-a-las-18.13.31.png@webp 640w, https:\/\/img.decrypt.co\/insecure\/rs:fit:750:0:0:0\/plain\/https:\/\/cdn.decrypt.co\/wp-content\/uploads\/2026\/06\/Captura-de-pantalla-2026-06-10-a-las-18.13.31.png@webp 750w, https:\/\/img.decrypt.co\/insecure\/rs:fit:828:0:0:0\/plain\/https:\/\/cdn.decrypt.co\/wp-content\/uploads\/2026\/06\/Captura-de-pantalla-2026-06-10-a-las-18.13.31.png@webp 828w, https:\/\/img.decrypt.co\/insecure\/rs:fit:1080:0:0:0\/plain\/https:\/\/cdn.decrypt.co\/wp-content\/uploads\/2026\/06\/Captura-de-pantalla-2026-06-10-a-las-18.13.31.png@webp 1080w, https:\/\/img.decrypt.co\/insecure\/rs:fit:1200:0:0:0\/plain\/https:\/\/cdn.decrypt.co\/wp-content\/uploads\/2026\/06\/Captura-de-pantalla-2026-06-10-a-las-18.13.31.png@webp 1200w, https:\/\/img.decrypt.co\/insecure\/rs:fit:1920:0:0:0\/plain\/https:\/\/cdn.decrypt.co\/wp-content\/uploads\/2026\/06\/Captura-de-pantalla-2026-06-10-a-las-18.13.31.png@webp 1920w, https:\/\/img.decrypt.co\/insecure\/rs:fit:2048:0:0:0\/plain\/https:\/\/cdn.decrypt.co\/wp-content\/uploads\/2026\/06\/Captura-de-pantalla-2026-06-10-a-las-18.13.31.png@webp 2048w, https:\/\/img.decrypt.co\/insecure\/rs:fit:3840:0:0:0\/plain\/https:\/\/cdn.decrypt.co\/wp-content\/uploads\/2026\/06\/Captura-de-pantalla-2026-06-10-a-las-18.13.31.png@webp 3840w\" src=\"https:\/\/img.decrypt.co\/insecure\/rs:fit:3840:0:0:0\/plain\/https:\/\/cdn.decrypt.co\/wp-content\/uploads\/2026\/06\/Captura-de-pantalla-2026-06-10-a-las-18.13.31.png@webp\"\/><\/figure>\n<\/div>\n<p class=\"font-meta-serif-pro scene:font-noto-sans scene:text-base scene:md:text-lg font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">The side effect is bidirectional attention\u2014every token can see every other token while being generated, which is impossible in autoregressive models (they cannot see the future, what is going to be encoded). That makes it unusually good at tasks where the end of the answer constrains the beginning: code infilling, structured output, constraint-heavy problems, etc. Google fine-tuned a version to solve Sudoku as a demo. The base model got roughly 0% of puzzles right.<\/p>\n<p class=\"font-meta-serif-pro scene:font-noto-sans scene:text-base scene:md:text-lg font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">The fine-tuned version hit 80%.<\/p>\n<p class=\"font-meta-serif-pro scene:font-noto-sans scene:text-base scene:md:text-lg font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">Text diffusion has been a research project for years. <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/kuleshov-group\/mdlm\" target=\"_blank\" class=\"sc-adb616fe-0 bJsyml\">MDLM<\/a>, <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.oxen.ai\/models\/SEDD-large\" target=\"_blank\" rel=\"nofollow external noopener\" class=\"sc-adb616fe-0 bJsyml\">SEDD<\/a>, <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2502.09992\" target=\"_blank\" class=\"sc-adb616fe-0 bJsyml\">LLaDA<\/a>, <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/huggingface.co\/collections\/Dream-org\/dream-7b\" target=\"_blank\" rel=\"nofollow external noopener\" class=\"sc-adb616fe-0 bJsyml\">Dream<\/a>\u2014academic models that proved the approach worked at small scales and mostly stayed as proof of concepts. Inception Labs shipped <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.businesswire.com\/news\/home\/20260224034496\/en\/Inception-Launches-Mercury-2-the-Fastest-Reasoning-LLM-5x-Faster-Than-Leading-Speed-Optimized-LLMs-with-Dramatically-Lower-Inference-Cost\" target=\"_blank\" rel=\"nofollow external noopener\" class=\"sc-adb616fe-0 bJsyml\">Mercury 2<\/a> in February 2026 as the first commercial diffusion reasoning model, claiming speeds five times faster than speed-optimized competitors.<\/p>\n<p><iframe loading=\"lazy\" style=\"border:0\" src=\"https:\/\/myriad.markets\/embed\/market\/who-ipos-first-d112e68a-b7d1-4991-9f77-ca2db82cdc99\" width=\"100%\" height=\"415px\"><span style=\"width:0px;overflow:hidden;line-height:0\" data-mce-type=\"bookmark\" class=\"mce_SELRES_start\">\ufeff<\/span><\/iframe><\/p>\n<p class=\"font-meta-serif-pro scene:font-noto-sans scene:text-base scene:md:text-lg font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">But none of that was open-weight, and none of it came with day-zero support in vLLM, Hugging Face Transformers, and Unsloth. DiffusionGemma is the first major open release from a tier-one lab.<\/p>\n<p class=\"font-meta-serif-pro scene:font-noto-sans scene:text-base scene:md:text-lg font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">There&#8217;s also a historical irony worth noting. Image generators started as diffusion models (hence the name Stable Diffusion) and are now moving toward autoregressive architectures for better quality. Language models started as autoregressive and are now experimenting with diffusion for speed.<\/p>\n<h2 class=\"sc-b2a202e4-2 bmropA gg-dark:text-white scene:font-itc-avant-garde-gothic-pro scene:font-light\" style=\"margin-top:2em;text-align:left;padding-bottom:16px;margin-bottom:16px;border-bottom:1px solid #dfe2e4\" color=\"#333\">Why it\u2019s a pain to run\u2026 for now<\/h2>\n<p class=\"font-meta-serif-pro scene:font-noto-sans scene:text-base scene:md:text-lg font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">Running DiffusionGemma efficiently requires a drafter\u2014a lightweight module that proposes token blocks in parallel, which the main model then verifies in one forward pass. This is called speculative decoding. <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2602.06036\" target=\"_blank\" class=\"sc-adb616fe-0 bJsyml\">DFlash<\/a> is a framework published in early 2026 that uses a small diffusion model as the drafter, enabling <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/decrypt.co\/370449\/xiaomi-mimo-ultraspeed-ai-model-faster-chatgpt-claude\" target=\"_blank\" class=\"sc-adb616fe-0 bJsyml\">over 6x speedup<\/a> on some tasks. It&#8217;s the engine that makes this class of model practical.<\/p>\n<p class=\"font-meta-serif-pro scene:font-noto-sans scene:text-base scene:md:text-lg font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">The problem: DiffusionGemma needs a specific drafter to run locally via MLX\u2014Apple&#8217;s machine learning framework for Apple Silicon. That module doesn&#8217;t exist in any public version of mlx-lm, in any open pull request, or in LM Studio&#8217;s bundled runtime.<\/p>\n<p class=\"font-meta-serif-pro scene:font-noto-sans scene:text-base scene:md:text-lg font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">We tried running DiffusionGemma with Hermes through NVIDIA NIM. The model loaded, but then: &#8220;agent init failed: Model google\/diffusiongemma-26b-a4b-it has a context window of 8,192 tokens, which is below the minimum 64,000 required by Hermes Agent.&#8221;<\/p>\n<p class=\"font-meta-serif-pro scene:font-noto-sans scene:text-base scene:md:text-lg font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">To be precise: DiffusionGemma&#8217;s actual context window is 256K tokens. The 8,192 figure was Nvidia messing things up by default, not the model&#8217;s architectural limit.<\/p>\n<p class=\"font-meta-serif-pro scene:font-noto-sans scene:text-base scene:md:text-lg font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">In practice, getting it configured correctly for agentic use requires manual work that most everyday users haven&#8217;t figured out yet, and Hermes Agent simply won&#8217;t initialize without it. Parallel speed means nothing if the agent can&#8217;t boot.<\/p>\n<p class=\"font-meta-serif-pro scene:font-noto-sans scene:text-base scene:md:text-lg font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">Hopefully, in the next few days, the community will produce better resources to run these models.<\/p>\n<h2 class=\"sc-b2a202e4-2 bmropA gg-dark:text-white scene:font-itc-avant-garde-gothic-pro scene:font-light\" style=\"margin-top:2em;text-align:left;padding-bottom:16px;margin-bottom:16px;border-bottom:1px solid #dfe2e4\" color=\"#333\">Who this is actually for<\/h2>\n<p class=\"font-meta-serif-pro scene:font-noto-sans scene:text-base scene:md:text-lg font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">Developers with NVIDIA RTX 4090 or 5090 hardware building real-time tools\u2014inline editors, autocomplete, code infilling, structured generation. That&#8217;s the target. As Decrypt <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/decrypt.co\/367095\/google-make-local-ai-3x-faster-no-new-hardware\" target=\"_blank\" class=\"sc-adb616fe-0 bJsyml\">covered in May<\/a>, Google has been on a steady push to make local inference faster without new hardware.<\/p>\n<p class=\"font-meta-serif-pro scene:font-noto-sans scene:text-base scene:md:text-lg font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">For researchers, bidirectional generation opens territory that autoregressive models simply can&#8217;t reach\u2014protein sequences, mathematical graphs, anything where position N depends on position N+50. That&#8217;s not a small thing.<\/p>\n<p class=\"font-meta-serif-pro scene:font-noto-sans scene:text-base scene:md:text-lg font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">Google <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/decrypt.co\/363178\/google-gemma-4-open-source-ai\" target=\"_blank\" class=\"sc-adb616fe-0 bJsyml\">launched Gemma 4 under Apache 2.0<\/a> in April, and DiffusionGemma continues that strategy. There&#8217;s already a draft llama.cpp PR open as of today. When the toolchain catches up, this reaches a much wider audience.<\/p>\n<p class=\"font-meta-serif-pro scene:font-noto-sans scene:text-base scene:md:text-lg font-normal text-lg md:text-xl md:leading-9 tracking-px text-body gg-dark:text-neutral-100\">On a machine with a capable discrete GPU, 1,000 tokens per second is real.<\/p>\n<div class=\"my-4 border-b border-decryptGridline\">\n<div class=\"text-start p-8 md:py-12 md:px-12 max-w-prose relative\"><span class=\"border-t-4 border-l-4 w-4 h-4 md:border-t-[6px] md:border-l-[6px] md:w-6 md:h-6 border-decryptPurple dark:border-decryptNeon gg-dark:border-cc-pink-2 absolute top-4 left-4 md:top-6 md:left-6\"\/><span class=\"border-t-4 border-l-4 w-4 h-4 md:border-t-[6px] md:border-l-[6px] md:w-6 md:h-6 border-decryptPurple dark:border-decryptNeon gg-dark:border-cc-pink-2 absolute rotate-180 bottom-4 right-4 md:bottom-6 md:right-6\"\/><\/p>\n<h3 class=\"font-akzidenz-grotesk font-bold text-xl md:text-3xl md:text-center gg-dark:text-white\">Daily Debrief<!-- --> Newsletter<\/h3>\n<p>Start every day with the top news stories right now, plus original features, a podcast, videos and more.<\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>In brief Google released DiffusionGemma, a free open-weight model that generates entire 256-token blocks simultaneously via text diffusion\u2014hitting over 1,000 tokens per second on an NVIDIA H100, four times faster than standard autoregressive models. The custom drafter module DiffusionGemma needs for local inference doesn&#8217;t exist in any public runtime yet\u2014not in mlx-lm, not in LM [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":430264,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[230,225,221,227,226,228,229,60,223,224,222],"class_list":["post-430257","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cryptocurrency","tag-brave","tag-coinbase","tag-crypto","tag-decentralised","tag-decentralized","tag-decentralized-exchange","tag-erc-20","tag-featured","tag-meme-coin","tag-robinhood","tag-solana"],"_links":{"self":[{"href":"https:\/\/dripp.zone\/news\/wp-json\/wp\/v2\/posts\/430257","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dripp.zone\/news\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dripp.zone\/news\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dripp.zone\/news\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/dripp.zone\/news\/wp-json\/wp\/v2\/comments?post=430257"}],"version-history":[{"count":1,"href":"https:\/\/dripp.zone\/news\/wp-json\/wp\/v2\/posts\/430257\/revisions"}],"predecessor-version":[{"id":430268,"href":"https:\/\/dripp.zone\/news\/wp-json\/wp\/v2\/posts\/430257\/revisions\/430268"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/dripp.zone\/news\/wp-json\/wp\/v2\/media\/430264"}],"wp:attachment":[{"href":"https:\/\/dripp.zone\/news\/wp-json\/wp\/v2\/media?parent=430257"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dripp.zone\/news\/wp-json\/wp\/v2\/categories?post=430257"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dripp.zone\/news\/wp-json\/wp\/v2\/tags?post=430257"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}