DiscoverThursdAI - The top AI news from the past week
ThursdAI - The top AI news from the past week
Claim Ownership

ThursdAI - The top AI news from the past week

Author: From Weights & Biases, Join AI Evangelist Alex Volkov and a panel of experts to cover everything important that happened in the world of AI from the past week

Subscribed: 23Played: 316
Share

Description

Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week.

Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more.

sub.thursdai.news
77 Episodes
Reverse
👋 Hey all, this is Alex, coming to you from the very Sunny California, as I'm in SF again, while there is a complete snow storm back home in Denver (brrr).I flew here for the Hackathon I kept telling you about, and it was glorious, we had over 400 registered, over 200 approved hackers, 21 teams submitted incredible projects 👏 You can follow some of these hereI then decided to stick around and record the show from SF, and finally pulled the plug and asked for some budget, and I present, the first ThursdAI, recorded from the newly minted W&B Podcast studio at our office in SF 🎉This isn't the only first, today also, for the first time, all of the regular co-hosts of ThursdAI, met on video for the first time, after over a year of hanging out weekly, we've finally made the switch to video, and you know what? Given how good AI podcasts are getting, we may have to stick around with this video thing! We played one such clip from a new model called hertz-dev, which is a <10B model for full duplex audio.Given that today's episode is a video podcast, I would love for you to see it, so here's the timestamps for the chapters, which will be followed by the TL;DR and show notes in raw format. I would love to hear from folks who read the longer form style newsletters, do you miss them? Should I bring them back? Please leave me a comment 🙏 (I may send you a survey)This was a generally slow week (for AI!! not for... ehrm other stuff) and it was a fun podcast! Leave me a comment about what you think about this new format.Chapter Timestamps00:00 Introduction and Agenda Overview00:15 Open Source LLMs: Small Models01:25 Open Source LLMs: Large Models02:22 Big Companies and LLM Announcements04:47 Hackathon Recap and Community Highlights18:46 Technical Deep Dive: HertzDev and FishSpeech33:11 Human in the Loop: AI Agents36:24 Augmented Reality Lab Assistant36:53 Hackathon Highlights and Community Vibes37:17 Chef Puppet and Meta Ray Bans Raffle37:46 Introducing Fester the Skeleton38:37 Fester's Performance and Community Reactions39:35 Technical Insights and Project Details42:42 Big Companies API Updates43:17 Haiku 3.5: Performance and Pricing43:44 Comparing Haiku and Sonnet Models51:32 XAI Grok: New Features and Pricing57:23 OpenAI's O1 Model: Leaks and Expectations01:08:42 Transformer ASIC: The Future of AI Hardware01:13:18 The Future of Training and Inference Chips01:13:52 Oasis Demo and Etched AI Controversy01:14:37 Nisten's Skepticism on Etched AI01:19:15 Human Layer Introduction with Dex01:19:24 Building and Managing AI Agents01:20:54 Challenges and Innovations in AI Agent Development01:21:28 Human Layer's Vision and Future01:36:34 Recap and Closing RemarksThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Show Notes and Links:* Interview* Dexter Horthy (X) from HumanLayer* Open Source LLMs* SmolLM2: the new, best, and open 1B-parameter language mode (X)* Meta released MobileLLM (125M, 350M, 600M, 1B) (HF)* Tencent Hunyuan Large - 389B X 52B (Active) MoE (X, HF, Paper)* Big CO LLMs + APIs* OpenAI buys and opens chat.com* Anthropic releases Claude Haiku 3.5 via API (X, Blog)* OpenAI drops o1 full - and pulls it back (but not before it got Jailbroken)* X.ai now offers $25/mo free of Grok API credits (X, Platform)* Etched announces Sonu - first Transformer ASIC - 500K tok/s (etched)* PPXL is not valued at 9B lol* This weeks Buzz* Recap of SF Hackathon w/ AI Tinkerers (X)* Fester the Halloween Toy aka Project Halloweave videos from trick or treating (X, Writeup)* Voice & Audio* Hertz-dev - 8.5B conversation audio gen (X, Blog )* Fish Agent v0.1 3B - Speech to Speech model (HF, Demo)* AI Art & Diffusion & 3D* FLUX 1.1 [pro] is how HD - 4x resolution (X, blog)Full Transcription for convenience below: This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
Hey everyone, Happy Halloween! Alex here, coming to you live from my mad scientist lair! For the first ever, live video stream of ThursdAI, I dressed up as a mad scientist and had my co-host, Fester the AI powered Skeleton join me (as well as my usual cohosts haha) in a very energetic and hopefully entertaining video stream! Since it's Halloween today, Fester (and I) have a very busy schedule, so no super length ThursdAI news-letter today, as we're still not in the realm of Gemini being able to write a decent draft that takes everything we talked about and cover all the breaking news, I'm afraid I will have to wish you a Happy Halloween and ask that you watch/listen to the episode. The TL;DR and show links from today, don't cover all the breaking news but the major things we saw today (and caught live on the show as Breaking News) were, ChatGPT now has search, Gemini has grounded search as well (seems like the release something before Google announces it streak from OpenAI continues). Here's a quick trailer of the major things that happened: This weeks buzz - Halloween AI toy with WeaveIn this weeks buzz, my long awaited Halloween project is finally live and operational! I've posted a public Weave dashboard here and the code (that you can run on your mac!) hereReally looking forward to see all the amazing costumers the kiddos come up with and how Gemini will be able to respond to them, follow along! ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Ok and finally my raw TL;DR notes and links for this week. Happy halloween everyone, I'm running off to spook the kiddos (and of course record and post about it!)ThursdAI - Oct 31 - TL;DRTL;DR of all topics covered:* Open Source LLMs:* Microsoft's OmniParser: SOTA UI parsing (MIT Licensed) 𝕏* Groundbreaking model for web automation (MIT license).* State-of-the-art UI parsing and understanding.* Outperforms GPT-4V in parsing web UI.* Designed for web automation tasks.* Can be integrated into various development workflows.* ZhipuAI's GLM-4-Voice: End-to-end Chinese/English speech 𝕏* End-to-end voice model for Chinese and English speech.* Open-sourced and readily available.* Focuses on direct speech understanding and generation.* Potential applications in various speech-related tasks.* Meta releases LongVU: Video LM for long videos 𝕏* Handles long videos with impressive performance.* Uses DINOv2 for downsampling, eliminating redundant scenes.* Fuses features using DINOv2 and SigLIP.* Select tokens are passed to Qwen2/Llama-3.2-3B.* Demo and model are available on HuggingFace.* Potential for significant advancements in video understanding.* OpenAI new factuality benchmark (Blog, Github)* Introducing SimpleQA: new factuality benchmark* Goal: high correctness, diversity, challenging for frontier models* Question Curation: AI trainers, verified by second trainer* Quality Assurance: 3% inherent error rate* Topic Diversity: wide range of topics* Grading Methodology: "correct", "incorrect", "not attempted"* Model Comparison: smaller models answer fewer correctly* Calibration Measurement: larger models more calibrated* Limitations: only for short, fact-seeking queries* Conclusion: drive research on trustworthy AI* Big CO LLMs + APIs:* ChatGPT now has Search! (X)* Grounded search results in browsing the web* Still hallucinates* Reincarnation of Search GPT inside ChatGPT* Apple Intelligence Launch: Image features for iOS 18.2 [𝕏]( Link not provided in source material)* Officially launched for developers in iOS 18.2.* Includes Image Playground and Gen Moji.* Aims to enhance image creation and manipulation on iPhones.* GitHub Universe AI News: Co-pilot expands, new Spark tool 𝕏* GitHub Co-pilot now supports Claude, Gemini, and OpenAI models.* GitHub Spark: Create micro-apps using natural language.* Expanding the capabilities of AI-powered coding tools.* Copilot now supports multi-file edits in VS Code, similar to Cursor, and faster code reviews.* GitHub Copilot extensions are planned for release in 2025.* Grok Vision: Image understanding now in Grok 𝕏* Finally has vision capabilities (currently via 𝕏, API coming soon).* Can now understand and explain images, even jokes.* Early version, with rapid improvements expected.* OpenAI advanced voice mode updates (X)* 70% cheaper in input tokens because of automatic caching (X)* Advanced voice mode is now on desktop app* Claude this morning - new mac / pc App* This week's Buzz:* My AI Halloween toy skeleton is greeting kids right now (and is reporting to Weave dashboard)* Vision & Video:* Meta's LongVU: Video LM for long videos 𝕏 (see Open Source LLMs for details)* Grok Vision on 𝕏: 𝕏 (see Big CO LLMs + APIs for details)* Voice & Audio:* MaskGCT: New SoTA Text-to-Speech 𝕏* New open-source state-of-the-art text-to-speech model.* Zero-shot voice cloning, emotional TTS, long-form synthesis, variable speed synthesis, bilingual (Chinese & English).* Available on Hugging Face.* ZhipuAI's GLM-4-Voice: End-to-end Chinese/English speech 𝕏 (see Open Source LLMs for details)* Advanced Voice Mode on Desktops: 𝕏 (See Big CO LLMs + APIs for details).* AI Art & Diffusion: (See Red Panda in "This week's Buzz" above)* Redcraft Red Panda: new SOTA image diffusion 𝕏* High-performing image diffusion model, beating Black Forest Labs Flux.* 72% win rate, higher ELO than competitors.* Creates SVG files, editable as vector files.* From Redcraft V3.* Tools:* Bolt.new by StackBlitz: In-browser full-stack dev environment 𝕏* Platform for prompting, editing, running, and deploying full-stack apps directly in your browser.* Uses WebContainers.* Supports npm, Vite, Next.js, and integrations with Netlify, Cloudflare, and SuperBase.* Free to use.* Jina AI's Meta-Prompt: Improved LLM Codegen 𝕏 This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
Hey all, Alex here, coming to you from the (surprisingly) sunny Seattle, with just a mind-boggling week of releases. Really, just on Tuesday there was so much news already! I had to post a recap thread, something I do usually after I finish ThursdAI! From Anthropic reclaiming close-second sometimes-first AI lab position + giving Claude the wheel in the form of computer use powers, to more than 3 AI video generation updates with open source ones, to Apple updating Apple Intelligence beta, it's honestly been very hard to keep up, and again, this is literally part of my job! But once again I'm glad that we were able to cover this in ~2hrs, including multiple interviews with returning co-hosts ( Simon Willison came back, Killian came back) so definitely if you're only a reader at this point, listen to the show! Ok as always (recently) the TL;DR and show notes at the bottom (I'm trying to get you to scroll through ha, is it working?) so grab a bucket of popcorn, let's dive in 👇 ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Claude's Big Week: Computer Control, Code Wizardry, and the Mysterious Case of the Missing OpusAnthropic dominated the headlines this week with a flurry of updates and announcements. Let's start with the new Claude Sonnet 3.5 (really, they didn't update the version number, it's still 3.5 tho a different API model) Claude Sonnet 3.5: Coding Prodigy or Benchmark Buster?The new Sonnet model shows impressive results on coding benchmarks, surpassing even OpenAI's O1 preview on some. "It absolutely crushes coding benchmarks like Aider and Swe-bench verified," I exclaimed on the show. But a closer look reveals a more nuanced picture. Mixed results on other benchmarks indicate that Sonnet 3.5 might not be the universal champion some anticipated. My friend who has held back internal benchmarks was disappointed highlighting weaknesses in scientific reasoning and certain writing tasks. Some folks are seeing it being lazy-er for some full code completion, while the context window is now doubled from 4K to 8K! This goes to show again, that benchmarks don't tell the full story, so we wait for LMArena (formerly LMSys Arena) and the vibe checks from across the community. However it absolutely dominates in code tasks, that much is clear already. This is a screenshot of the new model on Aider code editing benchmark, a fairly reliable way to judge models code output, they also have a code refactoring benchmarkHaiku 3.5 and the Vanishing Opus: Anthropic's Cryptic CluesFurther adding to the intrigue, Anthropic announced Claude 3.5 Haiku! They usually provide immediate access, but Haiku remains elusive, saying that it's available by end of the month, which is very very soon. Making things even more curious, their highly anticipated Opus model has seemingly vanished from their website. "They've gone completely silent on 3.5 Opus," Simon Willison (𝕏) noted, mentioning conspiracy theories that this new Sonnet might simply be a rebranded Opus? 🕯️ 🕯️ We'll make a summoning circle for new Opus and update you once it lands (maybe next year) Claude Takes Control (Sort Of): Computer Use API and the Dawn of AI Agents (𝕏)The biggest bombshell this week? Anthropic's Computer Use. This isn't just about executing code; it’s about Claude interacting with computers, clicking buttons, browsing the web, and yes, even ordering pizza! Killian Lukas (𝕏), creator of Open Interpreter, returned to ThursdAI to discuss this groundbreaking development. "This stuff of computer use…it’s the same argument for having humanoid robots, the web is human shaped, and we need AIs to interact with computers and the web the way humans do" Killian explained, illuminating the potential for bridging the digital and physical worlds. Simon, though enthusiastic, provided a dose of realism: "It's incredibly impressive…but also very much a V1, beta.” Having tackled the setup myself, I agree; the current reliance on a local Docker container and virtual machine introduces some complexity and security considerations. However, seeing Claude fix its own Docker installation error was an unforgettably mindblowing experience. The future of AI agents is upon us, even if it’s still a bit rough around the edges.Here's an easy guide to set it up yourself, takes 5 minutes, requires no coding skills and it's safely tucked away in a container.Big Tech's AI Moves: Apple Embraces ChatGPT, X.ai API (+Vision!?), and Cohere Multimodal EmbeddingsThe rest of the AI world wasn’t standing still. Apple made a surprising integration, while X.ai and Cohere pushed their platforms forward.Apple iOS 18.2 Beta: Siri Phones a Friend (ChatGPT)Apple, always cautious, surprisingly integrated ChatGPT directly into iOS. While Siri remains…well, Siri, users can now effortlessly offload more demanding tasks to ChatGPT. "Siri is still stupid," I joked, "but can now ask it to write some stuff and it'll tell you, hey, do you want me to ask my much smarter friend ChatGPT about this task?" This approach acknowledges Siri's limitations while harnessing ChatGPT’s power. The iOS 18.2 beta also includes GenMoji (custom emojis!) and Visual Intelligence (multimodal camera search) which are both welcome, tho I didn't really get the need of the Visual Intelligence (maybe I'm jaded with my Meta Raybans that already have this and are on my face most of the time) and I didn't get into the GenMoji waitlist still waiting to show you some custom emojis! X.ai API: Grok's Enterprise Ambitions and a Secret Vision ModelElon Musk's X.ai unveiled their API platform, focusing on enterprise applications with Grok 2 beta. They also teased an undisclosed vision model, and they had vision APIs for some folks who joined their hackathon. While these models are still not worth using necessarily, the next Grok-3 is promising to be a frontier model, and for some folks, it's relaxed approach to content moderation (what Elon is calling maximally seeking the truth) is going to be a convincing point for some! I just wish they added fun mode and access to real time data from X! Right now it's just the Grok-2 model, priced at a very non competative $15/mTok 😒Cohere Embed 3: Elevating Multimodal Embeddings (Blog)Cohere launched Embed 3, enabling embeddings for both text and visuals such as graphs and designs. "While not the first multimodal embeddings, when it comes from Cohere, you know it's done right," I commented. Open Source Power: JavaScript Transformers and SOTA Multilingual ModelsThe open-source AI community continues to impress, making powerful models accessible to all.Massive kudos to Xenova (𝕏) for the release of Transformers.js v3! The addition of WebGPU support results in a staggering "up to 100 times faster" performance boost for browser-based AI, dramatically simplifying local, private, and efficient model running. We also saw DeepSeek’s Janus 1.3B, a multimodal image and text model, and Cohere For AI's Aya Expanse, supporting 23 languages.This Week’s Buzz: Hackathon Triumphs and Multimodal WeaveOn ThursdAI, we also like to share some of the exciting things happening behind the scenes.AI Chef Showdown: Second Place and Lessons LearnedHappy to report that team Yes Chef clinched second place in a hackathon with an unconventional creation: a Gordon Ramsay-inspired robotic chef hand puppet, complete with a cloned voice and visual LLM integration. We bought and 3D printed and assembled an Open Source robotic arm, made it become a ventriloquist operator by letting it animate a hand puppet, and cloned Ramsey's voice. It was so so much fun to build, and the code is hereWeave Goes Multimodal: Seeing and Hearing Your AIEven more exciting was the opportunity to leverage Weave's newly launched multimodal functionality. "Weave supports you to see and play back everything that's audio generated," I shared, emphasizing its usefulness in debugging our vocal AI chef. For a practical example, here's ALL the (NSFW) roasts that AI Chef has cooked me with, it's honestly horrifying haha. For full effect, turn on the background music first and then play the chef audio 😂📽️ Video Generation Takes Center Stage: Mochi's Motion Magic and Runway's Acting BreakthroughVideo models made a quantum leap this week, pushing the boundaries of generative AI.Genmo Mochi-1: Diffusion Transformers and Generative MotionGenmo's Ajay Jain (Genmo) joined ThursdAI to discuss Mochi-1, their powerful new diffusion transformer. "We really focused on…prompt adherence and motion," he explained. Mochi-1's capacity to generate complex and realistic motion is truly remarkable, and with an HD version on its way, the future looks bright (and animated!). They also get bonus points for dropping a torrent link in the announcement tweet.So far this apache 2, 10B Diffusion Transformer is open source, but not for the GPU-poors, as it requires 4 GPUs to run, but apparently there was already an attempt to run in on one single 4090 which, Ajay highlighted was one of the reasons they open sourced it! Runway Act-One: AI-Powered Puppetry and the Future of Acting (blog)Ok this one absolutely seems bonkers! Runway unveiled Act-One! Forget just generating video from text; Act-One takes a driving video and character image to produce expressive and nuanced character performances. "It faithfully represents elements like eye-lines, micro expressions, pacing, and delivery," I noted, excited by the transformative potential for animation and filmmaking.So no need for rigging, for motion capture suites on faces of actors, Runway now, does this, so you can generate characters with Flux, and animate them with Act-One 📽️ Just take a look at this insanity 👇 11labs Creative Voices: Prompting Your Way to the Perfect Voice11labs debuted an incredible feature: creating custom voices using only text prompts. Want a high-pitched squeak or a sophisticated British accent? Just ask.
Hey folks, Alex here from Weights & Biases, and this week has been absolutely bonkers. From robots walking among us to rockets landing on chopsticks (well, almost), the future is feeling palpably closer. And if real-world robots and reusable spaceship boosters weren't enough, the open-source AI community has been cooking, dropping new models and techniques faster than a Starship launch. So buckle up, grab your space helmet and noise-canceling headphones (we’ll get to why those are important!), and let's blast off into this week’s AI adventures!TL;DR and show-notes + links at the end of the post 👇Robots and Rockets: A Glimpse into the FutureI gotta start with the real-world stuff because, let's be honest, it's mind-blowing. We had Robert Scoble (yes, the Robert Scoble) join us after attending the Tesla We, Robot AI event, reporting on Optimus robots strolling through crowds, serving drinks, and generally being ridiculously futuristic. Autonomous robo-taxis were also cruising around, giving us a taste of a driverless future.Robert’s enthusiasm was infectious: "It was a vision of the future, and from that standpoint, it succeeded wonderfully." I couldn't agree more. While the market might have had a mini-meltdown (apparently investors aren't ready for robot butlers yet), the sheer audacity of Tesla’s vision is exhilarating. These robots aren't just cool gadgets; they represent a fundamental shift in how we interact with technology and the world around us. And they’re learning fast. Just days after the event, Tesla released a video of Optimus operating autonomously, showcasing the rapid progress they’re making.And speaking of audacious visions, SpaceX decided to one-up everyone (including themselves) by launching Starship and catching the booster with Mechazilla – their giant robotic chopsticks (okay, technically a launch tower, but you get the picture). Waking up early with my daughter to watch this live was pure magic. As Ryan Carson put it, "It was magical watching this… my kid who's 16… all of his friends are getting their imaginations lit by this experience." That’s exactly what we need - more imagination and less doomerism! The future is coming whether we like it or not, and I, for one, am excited.Open Source LLMs and Tools: The Community Delivers (Again!)Okay, back to the virtual world (for now). This week's open-source scene was electric, with new model releases and tools that have everyone buzzing (and benchmarking like crazy!).* Nemotron 70B: Hype vs. Reality: NVIDIA dropped their Nemotron 70B instruct model, claiming impressive scores on certain benchmarks (Arena Hard, AlpacaEval), even suggesting it outperforms GPT-4 and Claude 3.5. As always, we take these claims with a grain of salt (remember Reflection?), and our resident expert, Nisten, was quick to run his own tests. The verdict? Nemotron is good, "a pretty good model to use," but maybe not the giant-killer some hyped it up to be. Still, kudos to NVIDIA for pushing the open-source boundaries. (Hugging Face, Harrison Kingsley evals)* Zamba 2 : Hybrid Vigor: Zyphra, in collaboration with NVIDIA, released Zamba 2, a hybrid Sparse Mixture of Experts (SME) model. We had Paolo Glorioso, a researcher from Ziphra, join us to break down this unique architecture, which combines the strengths of transformers and state space models (SSMs). He highlighted the memory and latency advantages of SSMs, especially for on-device applications. Definitely worth checking out if you’re interested in transformer alternatives and efficient inference.* Zyda 2: Data is King (and Queen): Alongside Zamba 2, Zyphra also dropped Zyda 2, a massive 5 trillion token dataset, filtered, deduplicated, and ready for LLM training. This kind of open-source data release is a huge boon to the community, fueling the next generation of models. (X)* Ministral: Pocket-Sized Power: On the one-year anniversary of the iconic Mistral 7B release, Mistral announced two new smaller models – Ministral 3B and 8B. Designed for on-device inference, these models are impressive, but as always, Qwen looms large. While Mistral didn’t include Qwen in their comparisons, early tests suggest Qwen’s smaller models still hold their own. One point of contention: these Ministrals aren't as open-source as the original 7B, which is a bit of a bummer, with the 3B not being even released anywhere besides their platform. (Mistral Blog)* Entropix (aka Shrek Sampler): Thinking Outside the (Sample) Box: This one is intriguing! Entropix introduces a novel sampling technique aimed at boosting the reasoning capabilities of smaller LLMs. Nisten’s yogurt analogy explains it best: it’s about “marinating” the information and picking the best “flavor” (token) at the end. Early examples look promising, suggesting Entropix could help smaller models tackle problems that even trip up their larger counterparts. But, as with all shiny new AI toys, we're eagerly awaiting robust evals. Tim Kellog has an detailed breakdown of this method here* Gemma-APS: Fact-Finding Mission: Google released Gemma-APS, a set of models specifically designed for extracting claims and facts from text. While LLMs can already do this to some extent, a dedicated model for this task is definitely interesting, especially for applications requiring precise information retrieval. (HF) 🔥 OpenAI adds voice to their completion API (X, Docs)In the last second of the pod, OpenAI decided to grace us with Breaking News! Not only did they launch their Windows native app, but also added voice input and output to their completion APIs. This seems to be the same model as the advanced voice mode (and priced super expensively as well) and the one they used in RealTime API released a few weeks ago at DevDay. This is of course a bit slower than RealTime but is much simpler to use, and gives way more developers access to this incredible resource (I'm definitely planning to use this for ... things 😈) This isn't their "TTS" or "STT (whisper) models, no, this is an actual omni model that understands audio natively and also outputs audio natively, allowing for things like "count to 10 super slow"I've played with it just now (and now it's after 6pm and I'm still writing this newsletter) and it's so so awesome, I expect it to be huge because the RealTime API is very curbersome and many people don't really need this complexity. This weeks Buzz - Weights & Biases updates Ok I wanted to send a completely different update, but what I will show you is, Weave, our observability framework is now also Multi Modal! This couples very well with the new update from OpenAI! So here's an example usage with today's announcement, I'm going to go through the OpenAI example and show you how to use it with streaming so you can get the audio faster, and show you the Weave multimodality as well 👇You can find the code for this in this Gist and please give us feedback as this is brand newNon standard use-cases of AI cornerThis week I started noticing and collecting some incredible use-cases of Gemini and it's long context and multimodality and wanted to share with you guys, so we had some incredible conversations about non-standard use cases that are pushing the boundaries of what's possible with LLMs.Hrishi blew me away with his experiments using Gemini for transcription and diarization. Turns out, Gemini is not only great at transcription (it beats whisper!), it’s also ridiculously cheaper than dedicated ASR models like Whisper, around 60x cheaper! He emphasized the unexplored potential of prompting multimodal models, adding, “the prompting on these things… is still poorly understood." So much room for innovation here!Simon Willison then stole the show with his mind-bending screen-scraping technique. He recorded a video of himself clicking through emails, fed it to Gemini Flash, and got perfect structured data in return. This trick isn’t just clever; it’s practically free, thanks to the ridiculously low cost of Gemini Flash. I even tried it myself, recording my X bookmarks and getting a near-perfect TLDR of the week’s AI news. The future of data extraction is here, and it involves screen recordings and very cheap (or free) LLMs.Here's Simon's example of how much this would cost him had he actually be charged for it. 🤯Speaking of Simon Willison , he broke the news that NotebookLM has got an upgrade, with the ability to steer the speakers with custom commands, which Simon promptly used to ask the overview hosts to talk like Pelicans Voice Cloning, Adobe Magic, and the Quest for Real-Time AvatarsVoice cloning also took center stage this week, with the release of F5-TTS. This open-source model performs zero-shot voice cloning with just a few seconds of audio, raising all sorts of ethical questions (and exciting possibilities!). I played a sample on the show, and it was surprisingly convincing (though not without it's problems) for a local model! This, combined with Hallo 2's (also released this week!) ability to animate talking avatars, has Wolfram Ravenwolf dreaming of real-time AI assistants with personalized faces and voices. The pieces are falling into place, folks.And for all you Adobe fans, Firefly Video has landed! This “commercially safe” text-to-video and image-to-video model is seamlessly integrated into Premiere, offering incredible features like extending video clips with AI-generated frames. Photoshop also got some Firefly love, with mind-bending relighting capabilities that could make AI-generated images indistinguishable from real photographs.Wrapping Up:Phew, that was a marathon, not a sprint! From robots to rockets, open source to proprietary, and voice cloning to video editing, this week has been a wild ride through the ever-evolving landscape of AI. Thanks for joining me on this adventure, and as always, keep exploring, keep building, and keep pushing those AI boundaries. The future is coming, and it’s going to be amazing.P.S. Don’t forget to subscribe to the podcast and newsletter for more
Hey Folks, we are finally due for a "relaxing" week in AI, no more HUGE company announcements (if you don't consider Meta Movie Gen huge), no conferences or dev days, and some time for Open Source projects to shine. (while we all wait for Opus 3.5 to shake things up) This week was very multimodal on the show, we covered 2 new video models, one that's tiny and is open source, and one massive from Meta that is aiming for SORA's crown, and 2 new VLMs, one from our friends at REKA that understands videos and audio, while the other from Rhymes is apache 2 licensed and we had a chat with Kwindla Kramer about OpenAI RealTime API and it's shortcomings and voice AI's in general. ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.All right, let's TL;DR and show notes, and we'll start with the 2 Nobel prizes in AI 👇 * 2 AI nobel prizes* John Hopfield and Geoffrey Hinton have been awarded a Physics Nobel prize* Demis Hassabis, John Jumper & David Baker, have been awarded this year's #NobelPrize in Chemistry.* Open Source LLMs & VLMs* TxT360: a globally deduplicated dataset for LLM pre-training ( Blog, Dataset)* Rhymes Aria - 25.3B multimodal MoE model that can take image/video inputs Apache 2 (Blog, HF, Try It)* Maitrix and LLM360 launch a new decentralized arena (Leaderboard, Blog)* New Gradio 5 with server side rendering (X)* LLamaFile now comes with a chat interface and syntax highlighting (X)* Big CO LLMs + APIs* OpenAI releases MLEBench - new kaggle focused benchmarks for AI Agents (Paper, Github)* Inflection is still alive - going for enterprise lol (Blog)* new Reka Flash 21B - (X, Blog, Try It)* This weeks Buzz* We chatted about Cursor, it went viral, there are many tips* WandB releases HEMM - benchmarks of text-to-image generation models (X, Github, Leaderboard)* Vision & Video* Meta presents Movie Gen 30B - img and text to video models (blog, paper)* Pyramid Flow - open source img2video model MIT license (X, Blog, HF, Paper, Github)* Voice & Audio* Working with OpenAI RealTime Audio - Alex conversation with Kwindla from trydaily.com* Cartesia Sonic goes multilingual (X)* Voice hackathon in SF with 20K prizes (and a remote track) - sign up* Tools* LM Studio ships with MLX natively (X, Download)* UITHUB.com - turn any github repo into 1 long file for LLMsA Historic Week: TWO AI Nobel Prizes!This week wasn't just big; it was HISTORIC. As Yam put it, "two Nobel prizes for AI in a single week. It's historic." And he's absolutely spot on! Geoffrey Hinton, often called the "grandfather of modern AI," alongside John Hopfield, were awarded the Nobel Prize in Physics for their foundational work on neural networks - work that paved the way for everything we're seeing today. Think back propagation, Boltzmann machines – these are concepts that underpin much of modern deep learning. It’s about time they got the recognition they deserve!Yoshua Bengio posted about this in a very nice quote: @HopfieldJohn and @geoffreyhinton, along with collaborators, have created a beautiful and insightful bridge between physics and AI. They invented neural networks that were not only inspired by the brain, but also by central notions in physics such as energy, temperature, system dynamics, energy barriers, the role of randomness and noise, connecting the local properties, e.g., of atoms or neurons, to global ones like entropy and attractors. And they went beyond the physics to show how these ideas could give rise to memory, learning and generative models; concepts which are still at the forefront of modern AI researchAnd Hinton's post-Nobel quote? Pure gold: “I’m particularly proud of the fact that one of my students fired Sam Altman." He went on to explain his concerns about OpenAI's apparent shift in focus from safety to profits. Spicy take! It sparked quite a conversation about the ethical implications of AI development and who’s responsible for ensuring its safe deployment. It’s a discussion we need to be having more and more as the technology evolves. Can you guess which one of his students it was? Then, not to be outdone, the AlphaFold team (Demis Hassabis, John Jumper, and David Baker) snagged the Nobel Prize in Chemistry for AlphaFold 2. This AI revolutionized protein folding, accelerating drug discovery and biomedical research in a way no one thought possible. These awards highlight the tangible, real-world applications of AI. It's not just theoretical anymore; it's transforming industries.Congratulations to all winners, and we gotta wonder, is this a start of a trend of AI that takes over every Nobel prize going forward? 🤔 Open Source LLMs & VLMs: The Community is COOKING!The open-source AI community consistently punches above its weight, and this week was no exception. We saw some truly impressive releases that deserve a standing ovation. First off, the TxT360 dataset (blog, dataset). Nisten, resident technical expert, broke down the immense effort: "The amount of DevOps and…operations to do this work is pretty rough." This globally deduplicated 15+ trillion-token corpus combines the best of Common Crawl with a curated selection of high-quality sources, setting a new standard for open-source LLM training. We talked about the importance of deduplication for model training - avoiding the "memorization" of repeated information that can skew a model's understanding of language. TxT360 takes a 360-degree approach to data quality and documentation – a huge win for accessibility.Apache 2 Multimodal MoE from Rhymes AI called Aria (blog, HF, Try It )Next, the Rhymes Aria model (25.3B total and only 3.9B active parameters!) This multimodal marvel operates as a Mixture of Experts (MoE), meaning it activates only the necessary parts of its vast network for a given task, making it surprisingly efficient. Aria excels in understanding image and video inputs, features a generous 64K token context window, and is available under the Apache 2 license – music to open-source developers’ ears! We even discussed its coding capabilities: imagine pasting images of code and getting intelligent responses.I particularly love the focus on long multimodal input understanding (think longer videos) and super high resolution image support. I uploaded this simple pin-out diagram of RaspberriPy and it got all the right answers correct! Including ones I missed myself (and won against Gemini 002 and the new Reka Flash!) Big Companies and APIsOpenAI new Agentic benchmark, can it compete with MLEs on Kaggle?OpenAI snuck in a new benchmark, MLEBench (Paper, Github), specifically designed to evaluate AI agents performance on Machine Learning Engineering tasks. Designed around a curated collection of Kaggle competitions, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. They found that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in 16.9% of competitions (though there are some that throw shade on this score)Meta comes for our reality with Movie GenBut let's be honest, Meta stole the show this week with Movie Gen (blog). This isn’t your average video generation model; it’s like something straight out of science fiction. Imagine creating long, high-definition videos, with different aspect ratios, personalized elements, and accompanying audio – all from text and image prompts. It's like the Holodeck is finally within reach! Unfortunately, despite hinting at its size (30B) Meta is not releasing this model (just yet) nor is it available widely so far! But we'll keep our fingers crossed that it drops before SORA. One super notable thing is, this model generates audio as well to accompany the video and it's quite remarkable. We listened to a few examples from Meta’s demo, and the sound effects were truly remarkable – everything from fireworks to rustling leaves. This model isn't just creating video, it's crafting experiences. (Sound on for the next example!)They also have personalization built in, which is showcased here by one of the leads of LLama ,Roshan, as a scientist doing experiments and the realism is quite awesome to see (but I get why they are afraid of releasing this in open weights)This Week’s Buzz: What I learned at Weights & Biases this weekMy "buzz" this week was less about groundbreaking models and more about mastering the AI tools we have. We had a team meeting to share our best tips and tricks for using Cursor, and when I shared those insights on X (thread), they went surprisingly viral! The big takeaway from the thread? Composer, Cursor’s latest feature, is a true game-changer. It allows for more complex refactoring and code generation across multiple files – the kind of stuff that would take hours manually. If you haven't tried Composer, you're seriously missing out. We also covered strategies for leveraging different models for specific tasks, like using O1 mini for outlining and then switching to the more robust Cloud 3.5 for generating code. Another gem we uncovered: selecting any text in the console and hitting opt+D will immediately send it to the chat to debug, super useful! Over at Weights & Biases, my talented teammate, Soumik, released HEMM (X, Github), a comprehensive benchmark specifically designed for text-to-image generation models. Want to know how different models fare on image quality and prompt comprehension? Head over to the leaderboard on Weave (Leaderboard) and find out! And yes, it's true, Weave, our LLM observability tool, is multimodal (well within the theme of today's update)Voice and Audio: Real-Time Conversations and the Quest for Affordable AIOpenAI's DevDay was just a few weeks back, but the ripple effects of their announcements are still being felt. The big one for voice AI enthusiasts like myself? The RealTime API, offering
Hey, it's Alex. Ok, so mind is officially blown. I was sure this week was going to be wild, but I didn't expect everyone else besides OpenAI to pile on, exactly on ThursdAI. Coming back from Dev Day (number 2) and am still processing, and wanted to actually do a recap by humans, not just the NotebookLM one I posted during the keynote itself (which was awesome and scary in a "will AI replace me as a podcaster" kind of way), and was incredible to have Simon Willison who was sitting just behind me most of Dev Day, join me for the recap! But then the news kept coming, OpenAI released Canvas, which is a whole new way of interacting with chatGPT, BFL released a new Flux version that's 8x faster, Rev released a Whisper killer ASR that does diarizaiton and Google released Gemini 1.5 Flash 8B, and said that with prompt caching (which OpenAI now also has, yay) this will cost a whopping 0.01 / Mtok. That's 1 cent per million tokens, for a multimodal model with 1 million context window. 🤯 This whole week was crazy, as last ThursdAI after finishing the newsletter I went to meet tons of folks at the AI Tinkerers in Seattle, and did a little EvalForge demo (which you can see here) and wanted to share EvalForge with you as well, it's early but very promising so feedback and PRs are welcome! WHAT A WEEK, TL;DR for those who want the links and let's dive in 👇 * OpenAI - Dev Day Recap (Alex, Simon Willison)* Recap of Dev Day* RealTime API launched* Prompt Caching launched* Model Distillation is the new finetune* Finetuning 4o with images (Skalski guide)* Fireside chat Q&A with Sam* Open Source LLMs * NVIDIA finally releases NVML (HF)* This weeks Buzz* Alex discussed his demo of EvalForge at the AI Tinkers event in Seattle in "This Week's Buzz". (Demo, EvalForge, AI TInkerers)* Big Companies & APIs* Google has released Gemini Flash 8B - 0.01 per million tokens cached (X, Blog)* Voice & Audio* Rev breaks SOTA on ASR with Rev ASR and Rev Diarize (Blog, Github, HF)* AI Art & Diffusion & 3D* BFL relases Flux1.1[pro] - 3x-6x faster than 1.0 and higher quality (was 🫐) - (Blog, Try it)The day I met Sam Altman / Dev Day recapLast Dev Day (my coverage here) was a "singular" day in AI for me, given it also had the "keep AI open source" with Nous Research and Grimes, and this Dev Day I was delighted to find out that the vibe was completely different, and focused less on bombastic announcements or models, but on practical dev focused things. This meant that OpenAI cherry picked folks who actively develop with their tools, and they didn't invite traditional media, only folks like yours truly, @swyx from Latent space, Rowan from Rundown, Simon Willison and Dan Shipper, you know, newsletter and podcast folks who actually build! This also allowed for many many OpenAI employees who work on the products and APIs we get to use, were there to receive feedback, help folks with prompting, and just generally interact with the devs, and build that community. I want to shoutout my friends Ilan (who was in the keynote as the strawberry salesman interacting with RealTime API agent), Will DePue from the SORA team, with whom we had an incredible conversation about ethics and legality of projects, Christine McLeavey who runs the Audio team, with whom I shared a video of my daughter crying when chatGPT didn't understand her, Katia, Kevin and Romain on the incredible DevEx/DevRel team and finally, my new buddy Jason who does infra, and was fighting bugs all day and only joined the pub after shipping RealTime to all of us. I've collected all these folks in a convenient and super high signal X list here so definitely give that list a follow if you'd like to tap into their streamsFor the actual announcements, I've already covered this in my Dev Day post here (which was payed subscribers only, but is now open to all) and Simon did an incredible summary on his Substack as well The highlights were definitely the new RealTime API that let's developers build with Advanced Voice Mode, Prompt Caching that will happen automatically and reduce all your long context API calls by a whopping 50% and finetuning of models that they are rebranding into Distillation and adding new tools to make it easier (including Vision Finetuning for the first time!)Meeting Sam AltmanWhile I didn't get a "media" pass or anything like this, and didn't really get to sit down with OpenAI execs (see Swyx on Latent Space for those conversations), I did have a chance to ask Sam multiple things. First at the closing fireside chat between Sam and Kevin Weil (CPO at OpenAI), Kevin first asked Sam a bunch of questions, and then they gave out the microphones to folks, and I asked the only question that got Sam to smileSam and Kevin went on for a while, and that Q&A was actually very interesting, so much so, that I had to recruit my favorite Notebook LM podcast hosts, to go through it and give you an overview, so here's that Notebook LM, with the transcript of the whole Q&A (maybe i'll publish it as a standalone episode? LMK in the comments)After the official day was over, there was a reception, at the same gorgeous Fort Mason location, with drinks and light food, and as you might imagine, this was great for networking.But the real post dev day event was hosted by OpenAI devs at a bar, Palm House, which both Sam and Greg Brokman just came to and hung out with folks. I missed Sam last time and was very eager to go and ask him follow up questions this time, when I saw he was just chilling at that bar, talking to devs, as though he didn't "just" complete the largest funding round in VC history ($6.6B at $175B valuation) and went through a lot of drama/turmoil with the departure of a lot of senior leadership! Sam was awesome to briefly chat with, tho as you might imagine, it was loud and tons of folks wanted selfies, but we did discuss how AI affects the real world, job replacement stuff were brought up, and how developers are using the OpenAI products. What we learned, thanks to Sigil, is that o1 was named partly as a "reset" like the main blogpost claimed and partly as "alien of extraordinary ability" , which is the the official designation of the o1 visa, and that Sam came up with this joke himself. Is anyone here smarter than o1? Do you think you still will by o2? One of the highest impact questions was by Sam himself to the audience.Who feels like they've spent a lot of time with O1, and they would say, like, I feel definitively smarter than that thing?— Sam AltmanWhen Sam asked this at first, a few hands hesitatingly went up. He then followed up with Do you think you still will by O2? No one. No one taking the bet.One of the challenges that we face is like, we know how to go do this thing that we think will be like, at least probably smarter than all of us in like a broad array of tasksThis was a very palpable moment that folks looked around and realized, what OpenAI folks have probably internalized a long time ago, we're living in INSANE times, and even those of us at the frontier or research, AI use and development, don't necessarily understand or internalize how WILD the upcoming few months, years will be. And then we all promptly forgot to have an existential crisis about it, and took our self driving Waymo's to meet Sam Altman at a bar 😂 This weeks Buzz from Weights & BiasesHey so... after finishing ThursdAI last week I went to Seattle Tinkerers event and gave a demo (and sponsored the event with a raffle of Meta Raybans). I demoed our project called EvalForge, which I built the frontend of and my collegue Anish on backend, as we tried to replicate the Who validates the validators paper by Shreya Shankar, here’s that demo, and EvalForge Github for many of you who asked to see it. Please let me know what you think, I love doing demos and would love feedback and ideas for the next one (coming up in October!)OpenAI chatGPT Canvas - a complete new way to interact with chatGPTJust 2 days after Dev Day, and as breaking news during the show, OpenAI also shipped a new way to interact with chatGPT, called Canvas! Get ready to say goodbye to simple chats and hello to a whole new era of AI collaboration! Canvas, a groundbreaking interface that transforms ChatGPT into a true creative partner for writing and coding projects. Imagine having a tireless copy editor, a brilliant code reviewer, and an endless source of inspiration all rolled into one – that's Canvas!Canvas moves beyond the limitations of a simple chat window, offering a dedicated space where you and ChatGPT can work side-by-side. Canvas opens in a separate window, allowing for a more visual and interactive workflow. You can directly edit text or code within Canvas, highlight sections for specific feedback, and even use a handy menu of shortcuts to request tasks like adjusting the length of your writing, debugging code, or adding final polish. And just like with your favorite design tools, you can easily restore previous versions using the back button.Per Karina, OpenAI has trained a special GPT-4o model specifically for Canvas, enabling it to understand the context of your project and provide more insightful assistance. They used synthetic data, generated by O1 which led them to outperform the basic version of GPT-4o by 30% in accuracy. A general pattern emerges, where new frontiers in intelligence are advancing also older models (and humans as well). Gemini Flash 8B makes intelligence essentially freeGoogle folks were not about to take this week litely and decided to hit back with one of the most insane upgrades to pricing I've seen. The newly announced Gemini Flash 1.5 8B is goint to cost just... $0.01 per million tokens 🤯 (when using caching, 3 cents when not cached) This basically turns intelligence free. And while it is free, it's still their multimodal model (supports images) and has a HUGE context window of 1M tokens. The evals look ridiculous as well, this 8B param model, now almost matches Flash from May of this year, less than 6 month a
Hey, Alex here. Super quick, as I’m still attending Dev Day, but I didn’t want to leave you hanging (if you're a paid subscriber!), I have decided to outsource my job and give the amazing podcasters of NoteBookLM the whole transcript of the opening keynote of OpenAI Dev Day.You can see a blog of everything they just posted hereHere’s a summary of all what was announced:* Developer-Centric Approach: OpenAI consistently emphasized the importance of developers in their mission to build beneficial AGI. The speaker stated, "OpenAI's mission is to build AGI that benefits all of humanity, and developers are critical to that mission... we cannot do this without you."* Reasoning as a New Frontier: The introduction of the GPT-4 series, specifically the "O1" models, marks a significant step towards AI with advanced reasoning capabilities, going beyond the limitations of previous models like GPT-3.* Multimodal Capabilities: OpenAI is expanding the potential of AI applications by introducing multimodal capabilities, particularly focusing on real-time speech-to-speech interaction through the new Realtime API.* Customization and Fine-Tuning: Empowering developers to customize models is a key theme. OpenAI introduced Vision for fine-tuning with images and announced easier access to fine-tuning with model distillation tools.* Accessibility and Scalability: OpenAI demonstrated a commitment to making AI more accessible and cost-effective for developers through initiatives like price reductions, prompt caching, and model distillation tools.Important Ideas and Facts:1. The O1 Models:* Represent a shift towards AI models with enhanced reasoning capabilities, surpassing previous generations in problem-solving and logical thought processes.* O1 Preview is positioned as the most powerful reasoning model, designed for complex problems requiring extended thought processes.* O1 Mini offers a faster, cheaper, and smaller alternative, particularly suited for tasks like code debugging and agent-based applications.* Both models demonstrate advanced capabilities in coding, math, and scientific reasoning.* OpenAI highlighted the ability of O1 models to work with developers as "thought partners," understanding complex instructions and contributing to the development process.Quote: "The shift to reasoning introduces a new shape of AI capability. The ability for our model to scale and correct the process is pretty mind-blowing. So we are resetting the clock, and we are introducing a new series of models under the name O1."2. Realtime API:* Enables developers to build real-time AI experiences directly into their applications using WebSockets.* Launches with support for speech-to-speech interaction, leveraging the technology behind ChatGPT's advanced voice models.* Offers natural and seamless integration of voice capabilities, allowing for dynamic and interactive user experiences.* Showcased the potential to revolutionize human-computer interaction across various domains like driving, education, and accessibility.Quote: "You know, a lot of you have been asking about building amazing speech-to-speech experiences right into your apps. Well now, you can."3. Vision, Fine-Tuning, and Model Distillation:* Vision introduces the ability to use images for fine-tuning, enabling developers to enhance model performance in image understanding tasks.* Fine-tuning with Vision opens up opportunities in diverse fields such as product recommendations, medical imaging, and autonomous driving.* OpenAI emphasized the accessibility of these features, stating that "fine-tuning with Vision is available to every single developer."* Model distillation tools facilitate the creation of smaller, more efficient models by transferring knowledge from larger models like O1 and GPT-4.* This approach addresses cost concerns and makes advanced AI capabilities more accessible for a wider range of applications and developers.Quote: "With distillation, you take the outputs of a large model to supervise, to teach a smaller model. And so today, we are announcing our own model distillation tools."4. Cost Reduction and Accessibility:* OpenAI highlighted its commitment to lowering the cost of AI models, making them more accessible for diverse use cases.* Announced a 90% decrease in cost per token since the release of GPT-3, emphasizing continuous efforts to improve affordability.* Introduced prompt caching, automatically providing a 50% discount for input tokens the model has recently processed.* These initiatives aim to remove financial barriers and encourage wider adoption of AI technologies across various industries.Quote: "Every time we reduce the price, we see new types of applications, new types of use cases emerge. We're super far from the price equilibrium. In a way, models are still too expensive to be bought at massive scale."Conclusion:OpenAI DevDay conveyed a strong message of developer empowerment and a commitment to pushing the boundaries of AI capabilities. With new models like O1, the introduction of the Realtime API, and a dedicated focus on accessibility and customization, OpenAI is paving the way for a new wave of innovative and impactful AI applications developed by a global community. This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
Hey everyone, it's Alex (still traveling!), and oh boy, what a week again! Advanced Voice Mode is finally here from OpenAI, Google updated their Gemini models in a huge way and then Meta announced MultiModal LlaMas and on device mini Llamas (and we also got a "better"? multimodal from Allen AI called MOLMO!)From Weights & Biases perspective, our hackathon was a success this weekend, and then I went down to Menlo Park for my first Meta Connect conference, full of news and updates and will do a full recap here as well. ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Overall another crazy week in AI, and it seems that everyone is trying to rush something out the door before OpenAI Dev Day next week (which I'll cover as well!) Get ready, folks, because Dev Day is going to be epic!TL;DR of all topics covered: * Open Source LLMs * Meta llama 3.2 Multimodal models (11B & 90B) (X, HF, try free)* Meta Llama 3.2 tiny models 1B & 3B parameters (X, Blog, download)* Allen AI releases MOLMO - open SOTA multimodal AI models (X, Blog, HF, Try It)* Big CO LLMs + APIs* OpenAI releases Advanced Voice Mode to all & Mira Murati leaves OpenAI * Google updates Gemini 1.5-Pro-002 and 1.5-Flash-002 (Blog)* This weeks Buzz * Our free course is LIVE - more than 3000 already started learning how to build advanced RAG++* Sponsoring tonights AI Tinkerers in Seattle, if you're in Seattle, come through for my demo* Voice & Audio* Meta also launches voice mode (demo)* Tools & Others* Project ORION - holographic glasses are here! (link)Meta gives us new LLaMas and AI hardwareLLama 3.2 Multimodal 11B and 90BThis was by far the biggest OpenSource release of this week (tho see below, may not be the "best"), as a rumored released finally came out, and Meta has given our Llama eyes! Coming with 2 versions (well 4 if you count the base models which they also released), these new MultiModal LLaMas were trained with an adapter architecture, keeping the underlying text models the same, and placing a vision encoder that was trained and finetuned separately on top. LLama 90B is among the best open-source mutlimodal models available— Meta team at launchThese new vision adapters were trained on a massive 6 Billion images, including synthetic data generation by 405B for questions/captions, and finetuned with a subset of 600M high quality image pairs. Unlike the rest of their models, the Meta team did NOT claim SOTA on these models, and the benchmarks are very good but not the best we've seen (Qwen 2 VL from a couple of weeks ago, and MOLMO from today beat it on several benchmarks) With text-only inputs, the Llama 3.2 Vision models are functionally the same as the Llama 3.1 Text models; this allows the Llama 3.2 Vision models to be a drop-in replacement for Llama 3.1 8B/70B with added image understanding capabilities.Seems like these models don't support multi image or video as well (unlike Pixtral for example) nor tool use with images. Meta will also release these models on meta.ai and every other platform, and they cited a crazy 500 million monthly active users of their AI services across all their apps 🤯 which marks them as the leading AI services provider in the world now. Llama 3.2 Lightweight Models (1B/3B)The additional and maybe more exciting thing that we got form Meta was the introduction of the small/lightweight models of 1B and 3B parameters. Trained on up to 9T tokens, and distilled / pruned from larger models, these are aimed for on-device inference (and by device here we mean from laptops to mobiles to soon... glasses? more on this later) In fact, meta released an IOS demo, that runs these models, takes a group chat, summarizes and calls the calendar tool to schedule based on the conversation, and all this happens on device without the info leaving to a larger model. They have also been able to prune down the LLama-guard safety model they released to under 500Mb and have had demos of it running on client side and hiding user input on the fly as the user types something bad!Interestingly, here too, the models were not SOTA, even in small category, with tiny models like Qwen 2.5 3B beating these models on many benchmarks, but they are outlining a new distillation / pruning era for Meta as they aim for these models to run on device, eventually even glasses (and some said Smart Thermostats)In fact they are so tiny, that the communtiy quantized them, released and I was able to download these models, all while the keynote was still going! Here I am running the Llama 3B during the developer keynote! Speaking AI - not only from OpenAIZuck also showcased a voice based Llama that's coming to Meta AI (unlike OpenAI it's likely a pipeline of TTS/STT) but it worked really fast and Zuck was able to interrupt it. And they also showed a crazy animated AI avatar of a creator, that was fully backed by Llama, while the human creator was on stage, Zuck chatted with his avatar and reaction times were really really impressive. AI Hardware was glasses all along? Look we've all seen the blunders of this year, the Humane AI Ping, the Rabbit R1 (which sits on my desk and I haven't recharged in two months) but maybe Meta is the answer here? Zuck took a bold claim that glasses are actually the perfect form factor for AI, it sits on your face, sees what you see and hears what you hear, and can whisper in your ear without disrupting the connection between you and your conversation partner. They haven't announced new Meta Raybans, but did update the lineup with a new set of transition lenses (to be able to wear those glasses inside and out) and a special edition clear case pair that looks very sleek + new AI features like memories to be able to ask the glasses "hey Meta where did I park" or be able to continue the conversation. I had to get me a pair of this limited edition ones!Project ORION - first holographic glassesAnd of course, the biggest announcement of the Meta Connect was the super secret decade old project of fully holographic AR glasses, which they called ORION. Zuck introduced these as the most innovative and technologically dense set of glasses in the world. They always said the form factor will become just "glasses" and they actually did it ( a week after Snap spectacles ) tho those are not going to get released to any one any time soon, hell they only made a few thousand of these and they are extremely expensive.With 70 deg FOV, cameras, speakers and a compute puck, these glasses pack a full day battery with under 100grams of weight, and have a custom silicon, custom displays with MicroLED projector and just... tons of more innovation in there. They also come in 3 pieces, the glasses themselves, the compute wireless pack that will hold the LLaMas in your pocket and the EMG wristband that allows you to control these devices using muscle signals. These won't ship as a product tho so don't expect to get them soon, but they are real, and will allow Meta to build the product that we will get on top of these by 2030AI usecasesSo what will these glasses be able to do? well, they showed off a live translation feature on stage that mostly worked, where you just talk and listen to another language in near real time, which was great. There are a bunch of mixed reality games, you'd be able to call people and see them in your glasses on a virtual screen and soon you'll show up as an avatar there as well. The AI use-case they showed beyond just translation was MultiModality stuff, where they had a bunch of ingredients for a shake, and you could ask your AI assistant, which shake you can make with what it sees. Do you really need I'm so excited about these to finally come to people I screamed in the audience 👀👓OpenAI gives everyone* advanced voice mode It's finally here, and if you're paying for chatGPT you know this, the long announced Advanced Voice Mode for chatGPT is now rolled out to all plus members. The new updated since the beta are, 5 new voices (Maple, Spruce, Vale, Arbor and Sol), finally access to custom instructions and memory, so you can ask it to remember things and also to know who you are and your preferences (try saving your jailbreaks there) Unfortunately, as predicted, by the time it rolled out to everyone, this feels way less exciting than it did 6 month ago, the model is way less emotional, refuses to sing (tho folks are making it anyway) and generally feels way less "wow" than what we saw. Less "HER" than we wanted for sure Seriously, they nerfed the singing! Why OpenAI, why?Pro tip of mine that went viral : you can set your action button on the newer iphones to immediately start the voice conversation with 1 click. *This new mode is not available in EU This weeks Buzz - our new advanced RAG++ course is liveI had an awesome time with my colleagues Ayush and Bharat today, after they finally released a FREE advanced RAG course they've been working so hard on for the past few months! Definitely check out our conversation, but better yet, why don't you roll into the course? it's FREE and you'll get to learn about data ingestion, evaluation, query enhancement and more! New Gemini 002 is 50% cheaper, 2x faster and better at MMLU-proIt seems that every major lab (besides Anthropic) released a big thing this week to try and get under Meta's skin? Google announced an update to their Gemini Pro/Flash models, called 002, which is a very significant update!Not only are these models 50% cheaper now (Pro price went down by 50% on <128K context lengths), they are 2x faster on outputs with 3x lower latency on first tokens. It's really quite something to seeThe new models have also improved scores, with the Flash models (the super cheap ones, remember) from September, now coming close to or beating the Pro scores from May 2024! Definitely a worthy update from the team at Google! Hot off the press, the folks at Google Labs also added a
Hey folks, Alex here, back with another ThursdAI recap – and let me tell you, this week's episode was a whirlwind of open-source goodness, mind-bending inference techniques, and a whole lotta talk about talking AIs! We dove deep into the world of LLMs, from Alibaba's massive Qwen 2.5 drop to the quirky, real-time reactions of Moshi. We even got a sneak peek at Nous Research's ambitious new project, Forge, which promises to unlock some serious LLM potential. So grab your pumpkin spice latte (it's that time again isn't it? 🍁) settle in, and let's recap the AI awesomeness that went down on ThursdAI, September 19th! ThursdAI is brought to you (as always) by Weights & Biases, we still have a few spots left in our Hackathon this weekend and our new advanced RAG course is now released and is FREE to sign up!TL;DR of all topics + show notes and links* Open Source LLMs * Alibaba Qwen 2.5 models drop + Qwen 2.5 Math and Qwen 2.5 Code (X, HF, Blog, Try It)* Qwen 2.5 Coder 1.5B is running on a 4 year old phone (Nisten)* KyutAI open sources Moshi & Mimi (Moshiko & Moshika) - end to end voice chat model (X, HF, Paper)* Microsoft releases GRIN-MoE - tiny (6.6B active) MoE with 79.4 MMLU (X, HF, GIthub)* Nvidia - announces NVLM 1.0 - frontier class multimodal LLMS (no weights yet, X)* Big CO LLMs + APIs* OpenAI O1 results from LMsys do NOT disappoint - vibe checks also confirm, new KING llm in town (Thread)* NousResearch announces Forge in waitlist - their MCTS enabled inference product (X)* This weeks Buzz - everything Weights & Biases related this week* Judgement Day (hackathon) is in 2 days! Still places to come hack with us Sign up* Our new RAG Course is live - learn all about advanced RAG from WandB, Cohere and Weaviate (sign up for free)* Vision & Video* Youtube announces DreamScreen - generative AI image and video in youtube shorts ( Blog)* CogVideoX-5B-I2V - leading open source img2video model (X, HF)* Runway, DreamMachine & Kling all announce text-2-video over API (Runway, DreamMachine)* Runway announces video 2 video model (X)* Tools* Snap announces their XR glasses - have hand tracking and AI features (X)Open Source Explosion!👑 Qwen 2.5: new king of OSS llm models with 12 model releases, including instruct, math and coder versionsThis week's open-source highlight was undoubtedly the release of Alibaba's Qwen 2.5 models. We had Justin Lin from the Qwen team join us live to break down this monster drop, which includes a whopping seven different sizes, ranging from a nimble 0.5B parameter model all the way up to a colossal 72B beast! And as if that wasn't enough, they also dropped Qwen 2.5 Coder and Qwen 2.5 Math models, further specializing their LLM arsenal. As Justin mentioned, they heard the community's calls for 14B and 32B models loud and clear – and they delivered! "We do not have enough GPUs to train the models," Justin admitted, "but there are a lot of voices in the community...so we endeavor for it and bring them to you." Talk about listening to your users!Trained on an astronomical 18 trillion tokens (that’s even more than Llama 3.1 at 15T!), Qwen 2.5 shows significant improvements across the board, especially in coding and math. They even open-sourced the previously closed-weight Qwen 2 VL 72B, giving us access to the best open-source vision language models out there. With a 128K context window, these models are ready to tackle some serious tasks. As Nisten exclaimed after putting the 32B model through its paces, "It's really practical…I was dumping in my docs and my code base and then like actually asking questions."It's safe to say that Qwen 2.5 coder is now the best coding LLM that you can use, and just in time for our chat, a new update from ZeroEval confirms, Qwen 2.5 models are the absolute kings of OSS LLMS, beating Mistral large, 4o-mini, Gemini Flash and other huge models with just 72B parameters 👏 Moshi: The Chatty Cathy of AIWe've covered Moshi Voice back in July, and they have promised to open source the whole stack, and now finally they did! Including the LLM and the Mimi Audio Encoder! This quirky little 7.6B parameter model is a speech-to-speech marvel, capable of understanding your voice and responding in kind. It's an end-to-end model, meaning it handles the entire speech-to-speech process internally, without relying on separate speech-to-text and text-to-speech models.While it might not be a logic genius, Moshi's real-time reactions are undeniably uncanny. Wolfram Ravenwolf described the experience: "It's uncanny when you don't even realize you finished speaking and it already starts to answer." The speed comes from the integrated architecture and efficient codecs, boasting a theoretical response time of just 160 milliseconds!Moshi uses (also open sourced) Mimi neural audio codec, and achieves 12.5 Hz representation with just 1.1 kbps bandwidth.You can download it and run on your own machine or give it a try here just don't expect a masterful conversationalist heheGradient-Informed MoE (GRIN-MoE): A Tiny TitanJust before our live show, Microsoft dropped a paper on GrinMoE, a gradient-informed Mixture of Experts model. We were lucky enough to have the lead author, Liyuan Liu (aka Lucas), join us impromptu to discuss this exciting development. Despite having only 6.6B active parameters (16 x 3.8B experts), GrinMoE manages to achieve remarkable performance, even outperforming larger models like Phi-3 on certain benchmarks. It's a testament to the power of clever architecture and training techniques. Plus, it's open-sourced under the MIT license, making it a valuable resource for the community.NVIDIA NVLM: A Teaser for NowNVIDIA announced NVLM 1.0, their own set of multimodal LLMs, but alas, no weights were released. We’ll have to wait and see how they stack up against the competition once they finally let us get our hands on them. Interestingly, while claiming SOTA on some vision tasks, they haven't actually compared themselves to Qwen 2 VL, which we know is really really good at vision tasks 🤔 Nous Research Unveils Forge: Inference Time Compute Powerhouse (beating o1 at AIME Eval!)Fresh off their NousCon event, Karan and Shannon from Nous Research joined us to discuss their latest project, Forge. Described by Shannon as "Jarvis on the front end," Forge is an inference engine designed to push the limits of what’s possible with existing LLMs. Their secret weapon? Inference-time compute. By implementing sophisticated techniques like Monte Carlo Tree Search (MCTS), Forge can outperform larger models on complex reasoning tasks beating OpenAI's o1-preview at the AIME Eval, competition math benchmark, even with smaller, locally runnable models like Hermes 70B. As Karan emphasized, “We’re actually just scoring with Hermes 3.1, which is available to everyone already...we can scale it up to outperform everything on math, just using a system like this.”Forge isn't just about raw performance, though. It's built with usability and transparency in mind. Unlike OpenAI's 01, which obfuscates its chain of thought reasoning, Forge provides users with a clear visual representation of the model's thought process. "You will still have access in the sidebar to the full chain of thought," Shannon explained, adding, “There’s a little visualizer and it will show you the trajectory through the tree… you’ll be able to see exactly what the model was doing and why the node was selected.” Forge also boasts built-in memory, a graph database, and even code interpreter capabilities, initially supporting Python, making it a powerful platform for building complex LLM applications.Forge is currently in a closed beta, but a waitlist is open for eager users. Karan and Shannon are taking a cautious approach to the rollout, as this is Nous Research’s first foray into hosting a product. For those lucky enough to gain access, Forge offers a tantalizing glimpse into the future of LLM interaction, promising greater transparency, improved reasoning, and more control over the model's behavior.For ThursdAI readers early, here's a waitlist form to test it out!Big Companies and APIs: The Reasoning RevolutionOpenAI’s 01: A New Era of LLM ReasoningThe big story in the Big Tech world is OpenAI's 01. Since we covered it live last week as it dropped, many of us have been playing with these new reasoning models, and collecting "vibes" from the community. These models represent a major leap in reasoning capabilities, and the results speak for themselves. 01 Preview claimed the top spot across the board on the LMSys Arena leaderboard, demonstrating significant improvements in complex tasks like competition math and coding. Even the smaller 01 Mini showed impressive performance, outshining larger models in certain technical areas. (and the jump in ELO score above the rest in MATH is just incredible to see!) and some folks made this video viral, of a PHD candidate reacting to 01 writing in 1 shot, code that took him a year to write, check it out, it’s priceless. One key aspect of 01 is the concept of “inference-time compute”. As Noam Brown from OpenAI calls it, this represents a "new scaling paradigm", allowing the model to spend more time “thinking” during inference, leading to significantly improved performance on reasoning tasks. The implications of this are vast, opening up the possibility of LLMs tackling long-horizon problems in areas like drug discovery and physics.However, the opacity surrounding 01’s chain of thought reasoning being hidden/obfuscated and the ban on users asking about it was a major point of contention at least within the ThursdAI chat. As Wolfram Ravenwolf put it, "The AI gives you an answer and you can't even ask how it got there. That is the wrong direction." as he was referring to the fact that not only is asking about the reasoning impossible, some folks were actually getting threatening emails and getting banned from using the product all together 😮This Week's Buzz: Hackathons and RAG Courses!We're almost ready to
March 14th, 2023 was the day ThursdAI was born, it was also the day OpenAI released GPT-4, and I jumped into a Twitter space and started chaotically reacting together with other folks about what a new release of a paradigm shifting model from OpenAI means, what are the details, the new capabilities. Today, it happened again! Hey, it's Alex, I'm back from my mini vacation (pic after the signature) and boy am I glad I decided to not miss September 12th! The long rumored 🍓 thinking model from OpenAI, dropped as breaking news in the middle of ThursdAI live show, giving us plenty of time to react live! But before this, we already had an amazing show with some great guests! Devendra Chaplot from Mistral came on and talked about their newly torrented (yeah they did that again) Pixtral VLM, their first multi modal! , and then I had the honor to host Steven Johnson and Raiza Martin from NotebookLM team at Google Labs which shipped something so uncannily good, that I legit said "holy fu*k" on X in a reaction! So let's get into it (TL;DR and links will be at the end of this newsletter)OpenAI o1, o1 preview and o1-mini, a series of new "reasoning" modelsThis is it folks, the strawberries have bloomed, and we finally get to taste them. OpenAI has released (without a waitlist, 100% rollout!) o1-preview and o1-mini models to chatGPT and API (tho only for tier-5 customers) 👏 and are working on releasing 01 as well.These are models that think before they speak, and have been trained to imitate "system 2" thinking, and integrate chain-of-thought reasoning internally, using Reinforcement Learning and special thinking tokens, which allows them to actually review what they are about to say before they are saying it, achieving remarkable results on logic based questions.Specifically you can see the jumps in the very very hard things like competition math and competition code, because those usually require a lot of reasoning, which is what these models were trained to do well. New scaling paradigm Noam Brown from OpenAI calls this a "new scaling paradigm" and Dr Jim Fan explains why, with this new way of "reasoning", the longer the model thinks - the better it does on reasoning tasks, they call this "test-time compute" or "inference-time compute" as opposed to compute that was used to train the model. This shifting of computation down to inference time is the essence of the paradigm shift, as in, pre-training can be very limiting computationally as the models scale in size of parameters, they can only go so big until you have to start building out a huge new supercluster of GPUs to host the next training run (Remember Elon's Colossus from last week?). The interesting thing to consider here is, while current "thinking" times are ranging between a few seconds to a minute, imagine giving this model hours, days, weeks to think about new drug problems, physics problems 🤯.Prompting o1 Interestingly, a new prompting paradigm has also been introduced. These models now have CoT (think "step by step") built-in, so you no longer have to include it in your prompts. By simply switching to o1-mini, most users will see better results right off the bat. OpenAI has worked with the Devin team to test drive these models, and these folks found that asking the new models to just give the final answer often works better and avoids redundancy in instructions.The community of course will learn what works and doesn't in the next few hours, days, weeks, which is why we got 01-preview and not the actual (much better) o1. Safety implications and future plansAccording to Greg Brokman, this inference time compute also greatly helps with aligning the model to policies, giving it time to think about policies at length, and improving security and jailbreak preventions, not only logic. The folks at OpenAI are so proud of all of the above that they have decided to restart the count and call this series o1, but they did mention that they are going to release GPT series models as well, adding to the confusing marketing around their models. Open Source LLMs Reflecting on Reflection 70BLast week, Reflection 70B was supposed to launch live on the ThursdAI show, and while it didn't happen live, I did add it in post editing, and sent the newsletter, and packed my bag, and flew for my vacation. I got many DMs since then, and at some point couldn't resist checking and what I saw was complete chaos, and despite this, I tried to disconnect still until last night. So here's what I could gather since last night. The claims of a llama 3.1 70B finetune that Matt Shumer and Sahil Chaudhary from Glaive beating Sonnet 3.5 are proven false, nobody was able to reproduce those evals they posted and boasted about, which is a damn shame. Not only that, multiple trusted folks from our community, like Kyle Corbitt, Alex Atallah have reached out to Matt in to try to and get to the bottom of how such a thing would happen, and how claims like these could have been made in good faith. (or was there foul play) The core idea of something like Reflection is actually very interesting, but alas, the inability to replicate, but also to stop engaging with he community openly (I've reached out to Matt and given him the opportunity to come to the show and address the topic, he did not reply), keep the model on hugging face where it's still trending, claiming to be the world's number 1 open source model, all these smell really bad, despite multiple efforts on out part to give the benefit of the doubt here. As for my part in building the hype on this (last week's issues till claims that this model is top open source model), I addressed it in the beginning of the show, but then twitter spaces crashed, but unfortunately as much as I'd like to be able to personally check every thing I cover, I often have to rely on the reputation of my sources, which is easier with established big companies, and this time this approached failed me. This weeks Buzzzzzz - One last week till our hackathon! Look at this point, if you read this newsletter and don't know about our hackathon, then I really didn't do my job prompting it, but it's coming up, September 21-22 ! Join us, it's going to be a LOT of fun! 🖼️ Pixtral 12B from Mistral Mistral AI burst onto the scene with Pixtral, their first multimodal model! Devendra Chaplot, research scientist at Mistral, joined ThursdAI to explain their unique approach, ditching fixed image resolutions and training a vision encoder from scratch."We designed this from the ground up to...get the most value per flop," Devendra explained. Pixtral handles multiple images interleaved with text within a 128k context window - a far cry from the single-image capabilities of most open-source multimodal models. And to make the community erupt in thunderous applause (cue the clap emojis!) they released the 12 billion parameter model under the ultra-permissive Apache 2.0 license. You can give Pixtral a whirl on Hyperbolic, HuggingFace, or directly through Mistral.DeepSeek 2.5: When Intelligence Per Watt is KingDeepseek 2.5 launched amid the reflection news and did NOT get the deserved attention it.... deserves. It folded (no deprecated) Deepseek Coder into 2.5 and shows incredible metrics and a truly next-gen architecture. "It's like a higher order MOE", Nisten revealed, "which has this whole like pile of brain and it just like picks every time, from that." 🤯. DeepSeek 2.5 achieves maximum "intelligence per active parameter"Google's turning text into AI podcast for auditory learners with Audio OverviewsToday I had the awesome pleasure of chatting with Steven Johnson and Raiza Martin from the NotebookLM team at Google Labs. NotebookLM is a research tool, that if you haven't used, you should definitely give it a spin, and this week they launched something I saw in preview and was looking forward to checking out and honestly was jaw-droppingly impressed today. NotebookLM allows you to upload up to 50 "sources" which can be PDFs, web links that they will scrape for you, documents etc' (no multimodality so far) and will allow you to chat with them, create study guides, dive deeper and add notes as you study. This week's update allows someone who doesn't like reading, to turn all those sources into a legit 5-10 minute podcast, and that sounds so realistic, that I was honestly blown away. I uploaded a documentation of fastHTML in there.. and well hear for yourself The conversation with Steven and Raiza was really fun, podcast definitely give it a listen! Not to mention that Google released (under waitlist) another podcast creating tool called illuminate, that will convert ArXiv papers into similar sounding very realistic 6-10 minute podcasts! ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.There are many more updates from this week, there was a whole Apple keynote I missed, which had a new point and describe feature with AI on the new iPhones and Apple Intelligence, Google also released new DataGemma 27B, and more things in TL'DR which are posted here in raw format See you next week 🫡 Thank you for being a subscriber, weeks like this are the reason we keep doing this! 🔥 Hope you enjoy these models, leave in comments what you think about themTL;DR in raw format* Open Source LLMs * Reflect on Reflection 70B & Matt Shumer (X, Sahil)* Mixtral releases Pixtral 12B - multimodal model (X, try it)* Pixtral is really good at OCR says swyx* Interview with Devendra Chaplot on ThursdAI* Initial reports of Pixtral beating GPT-4 on WildVision arena from AllenAI* JinaIA reader-lm-0.5b and reader-lm-1.5b (X)* ZeroEval updates* Deepseek 2.5 - * Deepseek coder is now folded into DeepSeek v2.5* 89 HumanEval (up from 84 from deepseek v2)* 9 on MT-bench* Google - DataGemma 27B (RIG/RAG) for improving results * Retrieval-Interleaved Generation * 🤖 DataGemma: AI models that connect LLMs to Google'
Welcome back everyone, can you believe it's another ThursdAI already? And can you believe me when I tell you that friends of the pod Matt Shumer & Sahil form Glaive.ai just dropped a LLama 3.1 70B finetune that you can download that will outperform Claude Sonnet 3.5 while running locally on your machine? Today was a VERY heavy Open Source focused show, we had a great chat w/ Niklas, the leading author of OLMoE, a new and 100% open source MoE from Allen AI, a chat with Eugene (pico_creator) about RWKV being deployed to over 1.5 billion devices with Windows updates and a lot more. In the realm of the big companies, Elon shook the world of AI by turning on the biggest training cluster called Colossus (100K H100 GPUs) which was scaled in 122 days 😮 and Anthropic announced that they have 500K context window Claude that's only reserved if you're an enterprise customer, while OpenAI is floating an idea of a $2000/mo subscription for Orion, their next version of a 100x better chatGPT?! TL;DR* Open Source LLMs * Matt Shumer / Glaive - Reflection-LLama 70B beats Claude 3.5 (X, HF)* Allen AI - OLMoE - first "good" MoE 100% OpenSource (X, Blog, Paper, WandB)* RWKV.cpp is deployed with Windows to 1.5 Billion devices* MMMU pro - more robust multi disipline multimodal understanding bench (proj)* 01AI - Yi-Coder 1.5B and 9B (X, Blog, HF)* Big CO LLMs + APIs* Replit launches Agent in beta - from coding to production (X, Try It)* Ilya SSI announces 1B round from everyone (Post)* Cohere updates Command-R and Command R+ on API (Blog)* Claude Enterprise with 500K context window (Blog)* Claude invisibly adds instructions (even via the API?) (X)* Google got structured output finally (Docs)* Amazon to include Claude in Alexa starting this October (Blog)* X ai scaled Colossus to 100K H100 GPU goes online (X)* DeepMind - AlphaProteo new paper (Blog, Paper, Video)* This weeks Buzz* Hackathon did we mention? We're going to have Eugene and Greg as Judges!* AI Art & Diffusion & 3D* ByteDance - LoopyAvatar - Audio Driven portait avatars (Page)Open Source LLMsReflection Llama-3.1 70B - new 👑 open source LLM from Matt Shumer / GlaiveAI This model is BANANAs folks, this is a LLama 70b finetune, that was trained with a new way that Matt came up with, that bakes CoT and Reflection into the model via Finetune, which results in model outputting its thinking as though you'd prompt it in a certain way. This causes the model to say something, and then check itself, and then reflect on the check and then finally give you a much better answer. Now you may be thinking, we could do this before, RefleXion (arxiv.org/2303.11366) came out a year ago, so what's new? What's new is, this is now happening inside the models head, you don't have to reprompt, you don't even have to know about these techniques! So what you see above, is just colored differently, but all of it, is output by the model without extra prompting by the user or extra tricks in system prompt. the model thinks, plans, does chain of thought, then reviews and reflects, and then gives an answer! And the results are quite incredible for a 70B model 👇Looking at these evals, this is a 70B model that beats GPT-4o, Claude 3.5 on Instruction Following (IFEval), MATH, GSM8K with 99.2% 😮 and gets very close to Claude on GPQA and HumanEval! (Note that these comparisons are a bit of a apples to ... different types of apples. If you apply CoT and reflection to the Claude 3.5 model, they may in fact perform better on the above, as this won't be counted 0-shot anymore. But given that this new model is effectively spitting out those reflection tokens, I'm ok with this comparison)This is just the 70B, next week the folks are planning to drop the 405B finetune with the technical report, so stay tuned for that! Kudos on this work, go give Matt Shumer and Glaive AI a follow! Allen AI OLMoE - tiny "good" MoE that's 100% open source, weights, code, logsWe've previously covered OLMO from Allen Institute, and back then it was obvious how much commitment they have to open source, and this week they continued on this path with the release of OLMoE, an Mixture of Experts 7B parameter model (1B active parameters), trained from scratch on 5T tokens, which was completely open sourced. This model punches above its weights on the best performance/cost ratio chart for MoEs and definitely highest on the charts of releasing everything. By everything here, we mean... everything, not only the final weights file; they released 255 checkpoints (every 5000 steps), the training code (Github) and even (and maybe the best part) the Weights & Biases logs! It was a pleasure to host the leading author of the OLMoE paper, Niklas Muennighoff on the show today, so definitely give this segment a listen, he's a great guest and I learned a lot! Big Companies LLMs + APIAnthropic has 500K context window Claude but only for Enterprise? Well, this sucks (unless you work for Midjourney, Airtable or Deloitte). Apparently Anthropic has been sitting on Claude that can extend to half a million tokens in the context window, and decided to keep it to themselves and a few trial enterprises, and package it as an Enterprise offering. This offering now includes, beyond just the context window, also a native Github integration, and a few key enterprise features like access logs, provisioning and SCIM and all kinds of "procurement and CISO required" stuff enterprises look for. To be clear, this is a great move for Anthropic, and this isn't an API tier, this is for their front end offering, including the indredible artifacts tool, so that companies can buy their employees access to Claude.ai and have them be way more productive coding (hence the Github integration) or summarizing (very very) long documents, building mockups and one off apps etc' Anthropic is also in the news this week, because Amazon announced that it'll use Claude as the backbone for the smart (or "remarkable" as they call it) Alexa brains coming up in October, which, again, incredible for Anthropic distribution, as there are maybe 100M Alexa users in the world or so. Prompt injecting must stop! And lastly, there have been mounting evidence, including our own Wolfram Ravenwolf that confirmed it, that Anthropic is prompt injecting additional context into your own prompts, in the UI but also via the API! This is awful practice and if anyone from there reads this newsletter, please stop or at least acknowledge. Claude apparently just... thinks that it's something my users said, when in fact, it's some middle layer of anthropic security decided to just inject some additional words in there!XAI turns on the largest training GPU SuperCluster Colossus - 100K H100 GPUSThis is a huge deal for AI, specifically due to the time this took and the massive massive scale of this SuperCluster. SuperCluster means all these GPUs sit in one datacenter, drawing from the same power-grid and can effectively run single training jobs. This took just 122 days for Elon and the XAI team to go from an empty warehouse in Memphis to booting up an incredible 100K H100, and they claim that they will double this capacity by adding 50K H200 in the next few months. As Elon mentioned when they released Grok2, it was trained on 15K, and it matched GPT4! Per SemiAnalisys, this new Supercluster can train a GPT-4 level model in just 4 days 🤯 XAI was founded a year ago, and by end of this year, they plan for Grok to be the beast LLM in the world, and not just get to GPT-4ish levels, and with this + 6B investment they have taken in early this year, it seems like they are well on track, which makes some folks at OpenAI reportedly worriedThis weeks buzz - we're in SF in less than two weeks, join our hackathon! This time I'm very pleased to announce incredible judges for our hackathon, the spaces are limited, but there's still some spaces so please feel free to sign up and join usI'm so honored to announce that we'll have Eugene Yan (@eugeneyan), Greg Kamradt (@GregKamradt) and Charles Frye (@charles_irl) on the Judges panel. 🤩 It'll be incredible to have these folks see what hackers come up with, and I'm excited as this comes closer! Replit launches Agents beta - a fully integrated code → deployment agent Replit is a great integrated editing environment, with database and production in 1 click and they've had their LLMs trained on a LOT of code helping folks code for a while. Now they are launching agents, which seems very smart from them, given that development is much more than just coding. All the recent excitement we see about Cursor, is omitting the fact that those demos are only working for folks who already know how to set up the environment, and then there's the need to deploy to production, maintain.Replit has that basically built in, and now their Agent can build a plan and help you build those apps, and "ship" them, while showing you what they are doing. This is massive, and I can't wait to play around with this! The additional benefit of Replit is that they nailed the mobile app experience as well, so this now works from mobile, on the go! In fact, as I was writing this, I got so excited that I paused for 30 minutes, payed the yearly subscription and decided to give building an app a try! The fact that this can deploy and run the server and the frontend, detect errors, fix them, and then also provision a DB for me, provision Stripe, login buttons and everything else is quite insane. Can't wait to see what I can spin up with this 🔥 (and show all of you!) Loopy - Animated Avatars from ByteDance A new animated avatar project from folks at ByteDance just dropped, and it’s WAY clearer than anything we’ve seen before, like EMO or anything else. I will just add this video here for you to enjoy and look at the earring movements, vocal cords, eyes, everything! I of course wanted to know if I’ll ever be able to use this, and .. likely no, here’s the response I got from Jianwen one of the Authors today. That's it for this week, we've
Hey, for the least time during summer of 2024, welcome to yet another edition of ThursdAI, also happy skynet self-awareness day for those who keep track :) This week, Cerebras broke the world record for fastest LLama 3.1 70B/8B inference (and came on the show to talk about it) Google updated 3 new Geminis, Anthropic artifacts for all, 100M context windows are possible, and Qwen beats SOTA on vision models + much more! As always, this weeks newsletter is brought to you by Weights & Biases, did I mention we're doing a hackathon in SF in September 21/22 and that we have an upcoming free RAG course w/ Cohere & Weaviate? TL;DR* Open Source LLMs * Nous DisTrO - Distributed Training (X , Report)* NousResearch/ hermes-function-calling-v1 open sourced - (X, HF)* LinkedIN Liger-Kernel - OneLine to make Training 20% faster & 60% more memory Efficient (Github)* Cartesia - Rene 1.3B LLM SSM + Edge Apache 2 acceleration (X, Blog)* Big CO LLMs + APIs* Cerebras launches the fastest AI inference - 447t/s LLama 3.1 70B (X, Blog, Try It)* Google - Gemini 1.5 Flash 8B & new Gemini 1.5 Pro/Flash (X, Try it)* Google adds Gems & Imagen to Gemini paid tier* Anthropic artifacts available to all users + on mobile (Blog, Try it)* Anthropic publishes their system prompts with model releases (release notes)* OpenAI has project Strawberry coming this fall (via The information)* This weeks Buzz* WandB Hackathon hackathon hackathon (Register, Join)* Also, we have a new RAG course w/ Cohere and Weaviate (RAG Course)* Vision & Video* Zhipu AI CogVideoX - 5B Video Model w/ Less 10GB of VRAM (X, HF, Try it)* Qwen-2 VL 72B,7B,2B - new SOTA vision models from QWEN (X, Blog, HF)* AI Art & Diffusion & 3D* GameNgen - completely generated (not rendered) DOOM with SD1.4 (project)* FAL new LORA trainer for FLUX - trains under 5 minutes (Trainer, Coupon for ThursdAI)* Tools & Others* SimpleBench from AI Explained - closely matches human experience (simple-bench.com)ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Open SourceLet's be honest - ThursdAI is a love letter to the open-source AI community, and this week was packed with reasons to celebrate.Nous Research DiStRO + Function Calling V1Nous Research was on fire this week (aren't they always?) and they kicked off the week with the release of DiStRO, which is a breakthrough in distributed training. You see, while LLM training requires a lot of hardware, it also requires a lot of network bandwidth between the different GPUs, even within the same data center. Proprietary networking solutions like Nvidia NVLink, and more open standards like Ethernet work well within the same datacenter, but training across different GPU clouds has been unimaginable until now. Enter DiStRo, a new decentralized training by the mad geniuses at Nous Research, in which they reduced the required bandwidth to train a 1.2B param model from 74.4GB to just 86MB (857x)! This can have massive implications for training across compute clusters, doing shared training runs, optimizing costs and efficiency and democratizing LLM training access! So don't sell your old GPUs just yet, someone may just come up with a folding@home but for training the largest open source LLM, and it may just be Nous! Nous Research also released their function-calling-v1 dataset (HF) that was used to train Hermes-2, and we had InterstellarNinja who authored that dataset, join the show and chat about it. This is an incredible unlock for the open source community, as function calling become a de-facto standard now. Shout out to the Glaive team as well for their pioneering work that paved the way!LinkedIn's Liger Kernel: Unleashing the Need for Speed (with One Line of Code)What if I told you, that whatever software you develop, you can add 1 line of code, and it'll run 20% faster, and require 60% less memory? This is basically what Linkedin researches released this week with Liger Kernel, yes you read that right, Linkedin, as in the website you career related posts on! "If you're doing any form of finetuning, using this is an instant win"Wing Lian - AxolotlThis absolutely bonkers improvement in training LLMs, now works smoothly with Flash Attention, PyTorch FSDP and DeepSpeed. If you want to read more about the implementation of the triton kernels, you can see a deep dive here, I just wanted to bring this to your attention, even if you're not technical, because efficiency jumps like these are happening all the time. We are used to seeing them in capabilities / intelligence, but they are also happening on the algorithmic/training/hardware side, and it's incredible to see!Huge shoutout to Byron and team at Linkedin for this unlock, check out their Github if you want to get involved!Qwen-2 VL - SOTA image and video understanding + open weights mini VLMYou may already know that we love the folks at Qwen here on ThursdAI, not only because Junyang Lin is a frequeny co-host and we get to hear about their releases as soon as they come out (they seem to be releasing them on thursdays around the time of the live show, I wonder why!) But also because, they are committed to open source, and have released 2 models 7B and 2B with complete Apache 2 license! First of all, their Qwen-2 VL 72B model, is now SOTA at many benchmarks, beating GPT-4, Claude 3.5 and other much bigger models. This is insane. I literally had to pause Junyang and repeat what he said, this is a 72B param model, that beats GPT-4o on document understanding, on math, on general visual Q&A. Additional Capabilities & Smaller modelsThey have added new capabilities in these models, like being able to handle arbitrary resolutions, but the one I'm most excited about is the video understanding. These models can now understand up to 20 minutes of video sequences, and it's not just "split the video to 10 frames and do image caption", no, these models understand video progression and if I understand correctly how they do it, it's quite genius. They the video embed time progression into the model using a new technique called M-RoPE, which turns the time progression into rotary positional embeddings. Now, the 72B model is currently available via API, but we do get 2 new small models with Apache 2 license and they are NOT too shabby either! 7B parameters (HF) and 2B Qwen-2 VL (HF) are small enough to run completely on your machine, and the 2B parameter, scores better than GPT-4o mini on OCR-bench for example! I can't wait to finish writing and go play with these models! Big Companies & LLM APIsThe biggest news this week came from Cerebras System, a relatively unknown company, that shattered the world record for LLM inferencing out of the blue (and came on the show to talk about how they are doing it)Cerebras - fastest LLM inference on wafer scale chipsCerebras has introduced the concept of wafer scale chips to the world, which is, if you imagine a microchip, they are the size of a post stamp maybe? GPUs are bigger, well, Cerebras are making chips the sizes of an iPad (72 square inches), largest commercial chips in the world. And now, they created an inference stack on top of those chips, and showed that they have the fastest inference in the world, how fast? Well, they can server LLama 3.1 8B at a whopping 1822t/s. No really, this is INSANE speeds, as I was writing this, I copied all the words I had so far, went to inference.cerebras.ai , asked to summarize, pasted and hit send, and I immediately got a summary! "The really simple explanation is we basically store the entire model, whether it's 8B or 70B or 405B, entirely on the chip. There's no external memory, no HBM. We have 44 gigabytes of memory on chip."James WangThey not only store the whole model (405B coming soon), but they store it in full fp16 precision as well, so they don't quantize the models. Right now, they are serving it with 8K tokens in context window, and we had a conversation about their next steps being giving more context to developers. The whole conversation is well worth listening to, James and Ian were awesome to chat with, and while they do have a waitlist, as they gradually roll out their release, James said to DM him on X and mention ThursdAI, and he'll put you through, so you'll be able to get an OpenAI compatible API key and be able to test this insane speed. P.S - we also did an independent verification of these speeds, using Weave, and found Cerebras to be quite incredible for agentic purposes, you can read our report here and the weave dashboard hereAnthropic - unlocking just-in-time applications with artifacts for allWell, if you aren't paying claude, maybe this will convince you. This week, anthropic announced that artifacts are available to all users, not only their paid customers. Artifacts are a feature in Claude that is basically a side pane (and from this week, a drawer in their mobile apps) that allows you to see what Claude is building, by rendering the web application almost on the fly. They have also trained Claude in working with that interface, so it knows about the different files etcEffectively, this turns Claude into a web developer that will build mini web applications (without backend) for you, on the fly, for any task you can think of. Drop a design, and it'll build a mock of it, drop some data in a CSV and it'll build an interactive onetime dashboard visualizing that data, or just ask it to build an app helping you split the bill between friends by uploading a picture of a bill. Artifacts are share-able and remixable, so you can build something and share with friends, so here you go, an artifact I made, by dropping my notes into claude, and asking for a magic 8 Ball, that will spit out a random fact from today's editing of ThursdAI. I also provided Claude with an 8Ball image, but it didn't work due to restrictions, so instead I just uploaded that image to claude and asked it to recreate it
Hey there, Alex here with an end of summer edition of our show, which did not disappoint. Today is the official anniversary of stable diffusion 1.4 can you believe it? It's the second week in the row that we have an exclusive LLM launch on the show (after Emozilla announced Hermes 3 on last week's show), and spoiler alert, we may have something cooking for next week as well!This edition of ThursdAI is brought to you by W&B Weave, our LLM observability toolkit, letting you evaluate LLMs for your own use-case easilyAlso this week, we've covered both ends of AI progress, doomerist CEO saying "Fck Gen AI" vs an 8yo coder and I continued to geek out on putting myself into memes (I promised I'll stop... at some point) so buckle up, let's take a look at another crazy week: TL;DR* Open Source LLMs * AI21 releases Jamba1.5 Large / Mini hybrid Mamba MoE (X, Blog, HF)* Microsoft Phi 3.5 - 3 new models including MoE (X, HF)* BFCL 2 - Berkley Function Calling Leaderboard V2 (X, Blog, Leaderboard)* NVIDIA - Mistral Nemo Minitron 8B - Distilled / Pruned from 12B (HF)* Cohere paper proves - code improves intelligence (X, Paper)* MOHAWK - transformer → Mamba distillation method (X, Paper, Blog)* AI Art & Diffusion & 3D* Ideogram launches v2 - new img diffusion king 👑 + API (X, Blog, Try it) * Midjourney is now on web + free tier (try it finally)* Flux keeps getting better, cheaper, faster + adoption from OSS (X, X, X)* Procreate hates generative AI (X)* Big CO LLMs + APIs* Grok 2 full is finally available on X - performs well on real time queries (X)* OpenAI adds GPT-4o Finetuning (blog)* Google API updates - 1000 pages PDFs + LOTS of free tokens (X)* This weeks Buzz* Weights & Biases Judgement Day SF Hackathon in September 21-22 (Sign up to hack)* Video * Hotshot - new video model - trained by 4 guys (try it, technical deep dive)* Luma Dream Machine 1.5 (X, Try it) * Tools & Others* LMStudio 0.0.3 update - local RAG, structured outputs with any model & more (X)* Vercel - Vo now has chat (X)* Ark - a completely offline device - offline LLM + worlds maps (X)* Ricky's Daughter coding with cursor video is a must watch (video)The Best of the Best: Open Source Wins with Jamba, Phi 3.5, and Surprise Function Calling HeroesWe kick things off this week by focusing on what we love the most on ThursdAI, open-source models! We had a ton of incredible releases this week, starting off with something we were super lucky to have live, the official announcement of AI21's latest LLM: Jamba.AI21 Officially Announces Jamba 1.5 Large/Mini – The Powerhouse Architecture Combines Transformer and Mamba While we've covered Jamba release on the show back in April, Jamba 1.5 is an updated powerhouse. It's 2 models, Large and Mini, both MoE and both are still hybrid architecture of Transformers + Mamba that try to get both worlds. Itay Dalmedigos, technical lead at AI21, joined us on the ThursdAI stage for an exclusive first look, giving us the full rundown on this developer-ready model with an awesome 256K context window, but it's not just the size – it’s about using that size effectively. AI21 measured the effective context use of their model on the new RULER benchmark released by NVIDIA, an iteration of the needle in the haystack and showed that their models have full utilization of context, as opposed to many other models.“As you mentioned, we’re able to pack many, many tokens on a single GPU. Uh, this is mostly due to the fact that we are able to quantize most of our parameters", Itay explained, diving into their secret sauce, ExpertsInt8, a novel quantization technique specifically designed for MoE models. Oh, and did we mention Jamba is multilingual (eight languages and counting), natively supports structured JSON, function calling, document digestion… basically everything developers dream of. They even chucked in citation generation, as it's long context can contain full documents, your RAG app may not even need to chunk anything, and the citation can cite full documents!Berkeley Function Calling Leaderboard V2: Updated + Live (link)Ever wondered how to measure the real-world magic of those models boasting "I can call functions! I can do tool use! Look how cool I am!" 😎? Enter the Berkeley Function Calling Leaderboard (BFCL) 2, a battleground where models clash to prove their function calling prowess.Version 2 just dropped, and this ain't your average benchmark, folks. It's armed with a "Live Dataset" - a dynamic, user-contributed treasure trove of real-world queries, rare function documentations, and specialized use-cases spanning multiple languages. Translation: NO more biased, contaminated datasets. BFCL 2 is as close to the real world as it gets.So, who’s sitting on the Function Calling throne this week? Our old friend Claude 3.5 Sonnet, with an impressive score of 73.61. But breathing down its neck is GPT 4-0613 (the OG Function Calling master) with 73.5. That's right, the one released a year ago, the first one with function calling, in fact the first LLM with function calling as a concept IIRC!Now, prepare for the REAL plot twist. The top-performing open-source model isn’t some big name, resource-heavy behemoth. It’s a tiny little underdog called Functionary Medium 3.1, a finetuned version of Llama 3.1 that blew everyone away. It even outscored both versions of Claude 3 Opus AND GPT 4 - leaving folks scrambling to figure out WHO created this masterpiece.“I’ve never heard of this model. It's MIT licensed from an organization called MeetKai. Have you guys heard about Functionary Medium?” I asked, echoing the collective bafflement in the space. Yep, turns out there’s gold hidden in the vast landscape of open source models, just waiting to be unearthed ⛏️.Microsoft updates Phi 3.5 - 3 new models including an MoE + MIT license3 new Phi's dropped this week, including an MoE one, and a new revamped vision one. They look very decent on benchmark yet again, with the mini version (3.8B) seemingly beating LLama 3.1 8B on a few benchmarks.However, as previously the excitement is met with caution because Phi models seem great on benchmarks but then actually talking with them, folks are not as impressed usually. Terry from BigCodeBench also saw a significant decrease in coding ability for Phi 3.5 vs 3.1 Of course, we're not complaining, the models released with 128K context and MIT license. The thing I'm most excited about is the vision model updates, it has been updated with "multi-frame image understanding and reasoning" which is a big deal! This means understanding videos more natively across scenes. This weeks BuzzHey, if you're reading this, while sitting in the bay area, and you don't have plans for exactly a month from now, why don't you come and hack with me? (Register Free)Announcing, the first W&B hackathon, Judgement Day that's going to be focused on LLM as a judge! Come hack on innovative LLM as a judge ideas, UIs, evals and more, meet other like minded hackers and AI engineers and win great prizes! 🎨 AI Art: Ideogram Crowns Itself King, Midjourney Joins the Internet & FLUX everywhereWhile there was little news from big LLM labs this week, there is a LOT of AI art news, which is fitting to celebrate 2 year Stable Diffusion 1.4 anniversary! 👑 Ideogram v2: Text Wizardry and API Access (But No Loras… Yet?)With significantly improved realism, and likely the best text generation across all models out there, Ideogram v2 just took over the AI image generation game! Just look at that text sharpness! They now offer a selection of styles (Realistic, Design, 3D, Anime) and any aspect ratios you'd like and also, brands can now provide color palettes to control the outputs! Adding to this is a new API offering (.8c per image for the main model, .5c for the new turbo model of v2!) and a new IOS app, they also added the option (for premium users only) to search through a billion generations and their prompts, which is a great offering as well, as sometimes you don't even know what to prompt. They claim a significant improvement over Flux[pro] and Dalle-3 in text, alignment and overall, interesting that MJ was not compared! Meanwhile, Midjourney finally launched a website and a free tier, so no longer do you have to learn to use Discord to even try Midjourney. Meanwhile Flux enjoys the fruits of Open SourceWhile the Ideogram and MJ fight it out for the closed source, Black Forest Labs enjoys the fruits of released their weights in the open. Fal just released an update that LORAs run 2.5x faster and 2.5x cheaper, CivitAI has LORAs for pretty much every character and celebrity ported to FLUX already, different techniques like ControlNets Unions, IPAdapters and more are being trained as we speak and tutorials upon tutorials are released of how to customize these models, for free (shoutout to my friend Matt Wolfe for this one)you can now train your own face on fal.ai , replicate.com and astria.ai , and thanks to astria, I was able to find some old generations of my LORAs from the 1.5 days (not quite 1.4, but still, enough to show the difference between then and now) and whoa. 🤔 Is This AI Tool Necessary, Bro?Let’s end with a topic that stirred up a hornets nest of opinions this week: Procreate, a beloved iPad design app, publicly declared their "fing hate” for Generative AI. Yeah, you read that right. Hate. The CEO, in a public statement went FULL scorched earth - proclaiming that AI-powered features would never sully the pristine code of their precious app.“Instead of trying to bridge the gap, he’s creating more walls", Wolfram commented, echoing the general “dude… what?” vibe in the space. “It feels marketeerial”, I added, pointing out the obvious PR play (while simultaneously acknowledging the very REAL, very LOUD segment of the Procreate community that cheered this decision).Here’s the thing: you can hate the tech. You can lament the potential demise of the human creative spark. You can rail against the looming AI overlords. But one thing’s undeniable: thi
Look these crazy weeks don't seem to stop, and though this week started out a bit slower (while folks were waiting to see how the speculation about certain red berry flavored conspiracies are shaking out) the big labs are shipping! We've got space uncle Elon dropping an "almost-gpt4" level Grok-2, that's uncensored, has access to real time data on X and can draw all kinds of images with Flux, OpenAI announced a new ChatGPT 4o version (not the one from last week that supported structured outputs, a different one!) and Anthropic dropping something that makes AI Engineers salivate! Oh, and for the second week in a row, ThursdAI live spaces were listened to by over 4K people, which is very humbling, and awesome because for example today, Nous Research announced Hermes 3 live on ThursdAI before the public heard about it (and I had a long chat w/ Emozilla about it, very well worth listening to)TL;DR of all topics covered: * Big CO LLMs + APIs* Xai releases GROK-2 - frontier level Grok, uncensored + image gen with Flux (𝕏, Blog, Try It)* OpenAI releases another ChatGPT-4o (and tops LMsys again) (X, Blog)* Google showcases Gemini Live, Pixel Bugs w/ Gemini, Google Assistant upgrades ( Blog)* Anthropic adds Prompt Caching in Beta - cutting costs by u to 90% (X, Blog)* AI Art & Diffusion & 3D* Flux now has support for LORAs, ControlNet, img2img (Fal, Replicate)* Google Imagen-3 is out of secret preview and it looks very good (𝕏, Paper, Try It)* This weeks Buzz* Using Weights & Biases Weave to evaluate Claude Prompt Caching (X, Github, Weave Dash)* Open Source LLMs * NousResearch drops Hermes 3 - 405B, 70B, 8B LLama 3.1 finetunes (X, Blog, Paper)* NVIDIA Llama-3.1-Minitron 4B (Blog, HF)* AnswerAI - colbert-small-v1 (Blog, HF)* Vision & Video* Runway Gen-3 Turbo is now available (Try It)Big Companies & LLM APIsGrok 2: Real Time Information, Uncensored as Hell, and… Flux?!The team at xAI definitely knows how to make a statement, dropping a knowledge bomb on us with the release of Grok 2. This isn't your uncle's dad joke model anymore - Grok 2 is a legitimate frontier model, folks.As Matt Shumer excitedly put it “If this model is this good with less than a year of work, the trajectory they’re on, it seems like they will be far above this...very very soon” 🚀Not only does Grok 2 have impressive scores on MMLU (beating the previous GPT-4o on their benchmarks… from MAY 2024), it even outperforms Llama 3 405B, proving that xAI isn't messing around.But here's where things get really interesting. Not only does this model access real time data through Twitter, which is a MOAT so wide you could probably park a rocket in it, it's also VERY uncensored. Think generating political content that'd make your grandma clutch her pearls or imagining Disney characters breaking bad in a way that’s both hilarious and kinda disturbing all thanks to Grok 2’s integration with Black Forest Labs Flux image generation model. With an affordable price point ($8/month for x Premium including access to Grok 2 and their killer MidJourney competitor?!), it’ll be interesting to see how Grok’s "truth seeking" (as xAI calls it) model plays out. Buckle up, folks, this is going to be wild, especially since all the normies now have the power to create political memes, that look VERY realistic, within seconds. Oh yeah… and there’s the upcoming Enterprise API as well… and Grok 2’s made its debut in the wild on the LMSys Arena, lurking incognito as "sus-column-r" and is now placed on TOP of Sonnet 3.5 and comes in as number 5 overall!OpenAI last ChatGPT is back at #1, but it's all very confusing 😵‍💫As the news about Grok-2 was settling in, OpenAI decided to, well… drop yet another GPT-4.o update on us. While Google was hosting their event no less. Seriously OpenAI? I guess they like to one-up Google's new releases (they also kicked Gemini from the #1 position after only 1 week there)So what was anonymous-chatbot in Lmsys for the past week, was also released in ChatGPT interface, is now the best LLM in the world according to LMSYS and other folks, it's #1 at Math, #1 at complex prompts, coding and #1 overall. It is also available for us developers via API, but... they don't recommend using it? 🤔 The most interesting thing about this release is, they don't really know to tell us why it's better, they just know that it is, qualitatively and that it's not a new frontier-class model (ie, not 🍓 or GPT5) Their release notes on this are something else 👇 Meanwhile it's been 3 months, and the promised Advanced Voice Mode is only in the hands of a few lucky testers so far. Anthropic Releases Prompt Caching to Slash API Prices By up to 90%Anthropic joined DeepSeek's game of "Let's Give Devs Affordable Intelligence," this week rolling out prompt caching with up to 90% cost reduction on cached tokens (yes NINETY…🤯 ) for those of you new to all this technical sorceryPrompt Caching allows the inference provider to save users money by reusing repeated chunks of a long prompt form cache, reducing pricing and increasing time to first token, and is especially beneficial for longer contexts (>100K) use-cases like conversations with books, agents with a lot of memory, 1000 examples in prompt etc'We covered caching before with Gemini (in Google IO) and last week with DeepSeek, but IMO this is a better implementation from a frontier lab that's easy to get started, manages the timeout for you (unlike Google) and is a no brainer implementation. And, you'll definitely want to see the code to implement it all yourself, (plus Weave is free!🤩):"In this week's buzz category… I used Weave, our LLM observability tooling to super quickly evaluate how much cheaper Cloud Caching from Anthropic really is, I did a video of it and I posted the code … If you're into this and want to see how to actually do this … how to evaluate, the code is there for you" - AlexWith the ridiculous 90% price drop for those cached calls (Haiku basically becomes FREE and cached Claude is costs like Haiku, .30 cents per 1Mtok). For context, I took 5 transcripts of 2 hour podcast conversations, and it amounted to ~110,000 tokens overall, and was able to ask questions across all this text, and it cost me less than $1 (see in the above video) Code Here + Weave evaluation Dashboard hereAI Art, Diffusion, and Personalized AI On the FlySpeaking of mind blowing, Flux took over this week, thanks in no small part to Elon strategically leveraging their tech in Grok (and everyone reminding everyone else, that it's not Grok creating images, it's Flux!)Now, remember, the REAL magic happens when code meets open source, “Flux now has support for LORAs, ControlNet, img2img…" meaning developers have turned those foundational tools into artistic wizardry. With as little as $5 bucks and a few pictures, “You can train the best image model on your own face. ”🤯 (Seriously folks, head up to Fal.ai, give it a whirl… it’s awesome)Now if you combine the LORA tech with ControlNet tech, you can get VERY creepy very fast (I'm using my own face here but you get the idea), here's "me" as the distracted boyfriend meme, and the girlfriend, and the distraction 😂 (I'm sorry you had to see this, AI has gone too far! Shut it all down!)If seeing those creepy faces on screen isn't for you (I totally get that) there’s also Google IMAGEN 3, freshly escaped from secret preview and just waiting for you to unleash those artistic prompts on it! Google, despite being… Google, somehow figured out that a little competition does a lab good and rolled out a model that’s seriously impressive.Runway Video Gets a "Turbocharged" Upgrade🚀🚀🚀Ever tried those jaw-dropping text-to-video generators but groaned as you watched those seconds of video render painfully slowly?😭 Well Runway, creators of Gen 3, answered our prayers with the distilled turbocharged version that churns out those visuals in a blink 🤯🤯🤯 .What's truly cool is they unlocked it for FREE tier users (sign up and unleash those cinematic prompts right now!), letting everyday folks dip their toes in those previously-unfathomable waters. Even the skeptics at OpenBMB (Junyang knows what I'm talking about…) had to acknowledge that their efforts with MiniCPM V are impressive, especially the smooth way it captures video sequences better than models even twice its size 🤯.Open Source: Hermes 3 and The Next Generation of Open AI 🚀NousResearch Dropped Hermes 3: Your New Favorite AI (Yes Really)In the ultimate “We Dropped This On ThursdAI Before Even HuggingFace”, the legendary team at NousResearch dropped the hottest news since Qwen decided to play math God: Hermes 3 is officially here! 🤯“You’re about to get to use the FIRST big Finetune of LLama 3.1 405B… We don’t think there have been finetunes,” announced Emozilla who’s both co founder and resident master wizard of all things neural net, “And it's available to try for free thanks to Lambda, you can try it out right here ” (you’re all racing to their site as I type this, I KNOW it!). Not ONLY does this beauty run ridiculously smooth on Lambda, but here’s the real TL;DR:* Hermes 3 isn’t just 405B; there are 70B and 8B versions dropping simultaneously on Hugging Face, ready to crush benchmarks and melt your VRAM (in a GOOD way… okay maybe not so great for your power bill 😅).* On Benchmark, they beat LLama 3.1 instruct on a few evals and lose on some, which is quite decent, given that Meta team did an amazing job with their instruct finetuning (and probably spent millions of $ on it too)* Hermes 3 is all about user alignment, which our open source champion Wolfram Ravenwolf summarized beautifully: “When you have a model, and you run it on your system, IT MUST BE LOYAL TO YOU.” 😈Hermes 3 does just that with incredibly precise control via its godlike system prompt: “In Hermes 3 the system prompt is KING,” confirmed Emoz. It’s so powerful that the 405B version was practically suffering existential angst in their first conversation… I read that part outloud during the space
Hold on tight, folks, because THIS week on ThursdAI felt like riding a roller coaster through the wild world of open-source AI - extreme highs, mind-bending twists, and a sprinkle of "wtf is happening?" conspiracy theories for good measure. 😂 Theme of this week is, Open Source keeps beating GPT-4, while we're inching towards intelligence too cheap to meter on the API fronts. We even had a live demo so epic, folks at the Large Hadron Collider are taking notice! Plus, strawberry shenanigans abound (did Sam REALLY tease GPT-5?), and your favorite AI evangelist nearly got canceled on X! Buckle up; this is gonna be another long one! 🚀Qwen2-Math Drops a KNOWLEDGE BOMB: Open Source Wins AGAIN!When I say "open source AI is unstoppable", I MEAN IT. This week, the brilliant minds from Alibaba's Qwen team decided to show everyone how it's DONE. Say hello to Qwen2-Math-72B-Instruct - a specialized language model SO GOOD at math, it's achieving a ridiculous 84 points on the MATH benchmark. 🤯For context, folks... that's beating GPT-4, Claude Sonnet 3.5, and Gemini 1.5 Pro. We're not talking incremental improvements here - this is a full-blown DOMINANCE of the field, and you can download and use it right now. 🔥Get Qwen-2 Math from HuggingFace hereWhat made this announcement EXTRA special was that Junyang Lin , the Chief Evangelist Officer at Alibaba Qwen team, joined ThursdAI moments after they released it, giving us a behind-the-scenes peek at the effort involved. Talk about being in the RIGHT place at the RIGHT time! 😂They painstakingly crafted a massive, math-specific training dataset, incorporating techniques like Chain-of-Thought reasoning (where the model thinks step-by-step) to unlock this insane level of mathematical intelligence."We have constructed a lot of data with the form of ... Chain of Thought ... And we find that it's actually very effective. And for the post-training, we have done a lot with rejection sampling to create a lot of data sets, so the model can learn how to generate the correct answers" - Junyang LinNow I gotta give mad props to Qwen for going beyond just raw performance - they're open-sourcing this beast under an Apache 2.0 license, meaning you're FREE to use it, fine-tune it, adapt it to your wildest mathematical needs! 🎉But hold on... the awesomeness doesn't stop there! Remember those smaller, resource-friendly LLMs everyone's obsessed with these days? Well, Qwen released 7B and even 1.5B versions of Qwen-2 Math, achieving jaw-dropping scores for their size (70 for the 1.5B?? That's unheard of!).🤯 Nisten nearly lost his mind when he heard that - and trust me, he's seen things. 😂"This is insane! This is... what, Sonnet 3.5 gets what, 71? 72? This gets 70? And it's a 1.5B? Like I could run that on someone's watch. Real." - NistenWith this level of efficiency, we're talking about AI-powered calculators, tutoring apps, research tools that run smoothly on everyday devices. The potential applications are endless!MiniCPM-V 2.6: A Pocket-Sized GPT-4 Vision... Seriously! 🤯If Qwen's Math marvel wasn't enough open-source goodness for ya, OpenBMB had to get in on the fun too! This time, they're bringing the 🔥 to vision with MiniCPM-V 2.6 - a ridiculous 8 billion parameter VLM (visual language model) that packs a serious punch, even outperforming GPT-4 Vision on OCR benchmarks!OpenBMB drops a bomb on X hereI'll say this straight up: talking about vision models in a TEXT-based post is hard. You gotta SEE it to believe it. But folks... TRUST ME on this one. This model is mind-blowing, capable of analyzing single images, multi-image sequences, and EVEN VIDEOS with an accuracy that rivaled my wildest hopes for open-source.🤯Check out their playground and prepare to be stunnedIt even captured every single nuance in this viral toddler speed-running video I threw at it, with an accuracy I haven't seen in models THIS small:"The video captures a young child's journey through an outdoor park setting. Initially, the child ... is seen sitting on a curved stone pathway besides a fountain, dressed in ... a green t-shirt and dark pants. As the video progresses, the child stands up and begins to walk ..."Junyang said that they actually collabbed with the OpenBMB team and knows firsthand how much effort went into training this model:"We actually have some collaborations with OpenBMB... it's very impressive that they are using, yeah, multi-images and video. And very impressive results. You can check the demo... the performance... We care a lot about MMMU [the benchmark], but... it is actually relying much on large language models." - Junyang LinNisten and I have been talking for months about the relationship between these visual "brains" and the larger language model base powering their "thinking." While it seems smaller models are catching up fast, combining a top-notch visual processor like MiniCPM-V with a monster LLM like Quen72B or Llama 405B could unlock truly unreal capabilities.This is why I'm excited - open source lets us mix and match like this! We can Frankenstein the best parts together and see what emerges... and it's usually something mind-blowing. 🤯Thank you for reading ThursdAI - Recaps of the most high signal AI weekly spaces. This post is public so feel free to share it.From the Large Hadron Collider to YOUR Phone: This Model Runs ANYWHERE 🚀While Qwen2-Math is breaking records on one hand, Nisten's latest creation, Biggie-SmoLlm, is showcasing the opposite side of the spectrum. Trying to get the smallest/fastest coherent LLM possible, Nisten blew up on HuggingFace.Biggie-SmoLlm (Hugging Face) is TINY, efficient, and with some incredible optimization work from the folks right here on the show, it's reaching an insane 330 tokens/second on regular M3 chips. 🤯 That's WAY faster than real-time conversation, folks! And thanks to Eric Hartford's (from Cognitive Computation) awesome new optimizer, (Grok AdamW) it's surprisingly coherent for such a lil' fella.The cherry on top? Someone messaged Nisten saying they're using Biggie-SmoLlm at the Large. Hadron. Collider. 😳 I'll let that sink in for a second.It was incredible having ALL the key players behind Biggie-SmoLlm right there on stage: LDJ (whose Capybara dataset made it teaching-friendly), Junyang (whose Qwen work served as the base), and Eric (the optimizer mastermind himself). THIS, my friends, is what the ThursdAI community is ALL about! 🚀Speaking of which this week we got a new friend of the pod, Mark Saroufim, a long time PyTorch core maintainer, to join the community. This Week's Buzz (and Yes, It Involves Making AI Even Smarter) 🤓NeurIPS Hacker Cup 2024 - Can You Solve Problems Humans Struggle With? 🤔I've gotta hand it to my PyTorch friend, Mark Saroufim. He knows how to make AI interesting! He and his incredible crew (Weiwei from MSFT, some WandB brainiacs, and more) are bringing you NeurIPS Hacker Cup 2024 - a competition to push those coding agents to their ABSOLUTE limits. 🚀This isn't your typical "LeetCode easy" challenge, folks... These are problems SO hard, years of competitive programming experience are required to even attempt them! Mark himself said, “At this point, like, if a model does make a significant dent in this competition, uh, I think people would need to acknowledge that, like, LLMs can do a form of planning. ”And don't worry, total beginners: Mark and Weights & Biases are hosting a series of FREE sessions to level you up. Get those brain cells prepped and ready for the challenge and then Join the NeurIPS Hacker Cup Discord P.S. We're ALSO starting a killer AI Salon series in our SF office August 15th! You'll get a chance to chat with researches like Shreya Shankar - she's a leading voice on evaluation. More details and free tickets right here! AI Salons LinkBig Co & APIs - Towards intelligence too cheap to meter Open-source was crushing it this week... but that didn't stop Big AI from throwing a few curveballs. OpenAI is doubling down on structured data (AND cheaper models!), Google slashed Gemini prices again (as we trend towards intelligence too cheap to meter), and a certain strawberry mystery took over Twitter.DeepSeek context caching lowers price by 90% automatiicallyDeepSeek, those masters of ridiculously-good coding AI, casually dropped a bombshell - context caching for their API! 🤯If you're like "wait, what does THAT mean?", listen up because this is game-changing for production-grade AI:* Problem: LLMs get fed the ENTIRE conversation history EVERY. SINGLE. TIME. This wastes compute (and $$$) when info is repeated.* Solution: DeepSeek now remembers what you've said, automatically pulling from a cache when the conversation goes down familiar paths.* The Win: Up to 90% cheaper API calls. Yes, NINETY.😳 It costs 1.4 CENTS per million tokens for cached content. Let THAT sink in. 🤯As Nisten (always bringing the technical breakdowns) explained:"Everyone should be using LLMs this way!...The simplest way is to have a long conversation ... then you save it on disk... you don't have to wait again ... [it's] kind of free. DeepSeek... did this in a more dynamic way". - NistenEven Matt Shumer, who usually advocates for clever prompting over massive context, got legitimately hyped about the possibilities:"For me, and how we use LLMs... instead of gathering a million examples... curate a hundred gold examples... you have something better than if you fine-tuned it, and cheaper, and faster..." - Matt ShumerThink about this... instead of painstakingly fine-tuning, we can "guide" models with expertly crafted examples, letting them learn "on the fly" with minimal cost. Context as the NEW fine-tuning! 🤯P.S - Google actually also has caching on its Gemini API, but you have to opt-in, while this happens automatically with DeepSeek API! Google Goes "Price War Nuclear": Gemini Flash is Officially TOO CHEAPSpeaking of sneaky advancements from Google... they also dropped an update SO casually impactful, it almost got lost in the shuffle. Gemini Flas
Starting Monday, Apple released iOS 18.1 with Apple Intelligence, then Meta dropped SAM-2 (Segment Anything Model) and then Google first open sourced Gemma 2B and now (just literally 2 hours ago, during the live show) released Gemini 1.5 0801 experimental that takes #1 on LMsys arena across multiple categories, to top it all off we also got a new SOTA image diffusion model called FLUX.1 from ex-stability folks and their new Black Forest Lab.This week on the show, we had Joseph & Piotr Skalski from Roboflow, talk in depth about Segment Anything, and as the absolute experts on this topic (Skalski is our returning vision expert), it was an incredible deep dive into the importance dedicated vision models (not VLMs).We also had Lukas Atkins & Fernando Neto from Arcee AI talk to use about their new DistillKit and explain model Distillation in detail & finally we had Cristiano Giardina who is one of the lucky few that got access to OpenAI advanced voice mode + his new friend GPT-4o came on the show as well!Honestly, how can one keep up with all this? by reading ThursdAI of course, that's how but ⚠️ buckle up, this is going to be a BIG one (I think over 4.5K words, will mark this as the longest newsletter I penned, I'm sorry, maybe read this one on 2x? 😂)[ Chapters ] 00:00 Introduction to the Hosts and Their Work01:22 Special Guests Introduction: Piotr Skalski and Joseph Nelson04:12 Segment Anything 2: Overview and Capabilities15:33 Deep Dive: Applications and Technical Details of SAM219:47 Combining SAM2 with Other Models36:16 Open Source AI: Importance and Future Directions39:59 Introduction to Distillation and DistillKit41:19 Introduction to DistilKit and Synthetic Data41:41 Distillation Techniques and Benefits44:10 Introducing Fernando and Distillation Basics44:49 Deep Dive into Distillation Process50:37 Open Source Contributions and Community Involvement52:04 ThursdAI Show Introduction and This Week's Buzz53:12 Weights & Biases New Course and San Francisco Meetup55:17 OpenAI's Advanced Voice Mode and Cristiano's Experience01:08:04 SearchGPT Release and Comparison with Perplexity01:11:37 Apple Intelligence Release and On-Device AI Capabilities01:22:30 Apple Intelligence and Local AI01:22:44 Breaking News: Black Forest Labs Emerges01:24:00 Exploring the New Flux Models01:25:54 Open Source Diffusion Models01:30:50 LLM Course and Free Resources01:32:26 FastHTML and Python Development01:33:26 Friend.com: Always-On Listening Device01:41:16 Google Gemini 1.5 Pro Takes the Lead01:48:45 GitHub Models: A New Era01:50:01 Concluding Thoughts and FarewellShow Notes & Links* Open Source LLMs* Meta gives SAM-2 - segment anything with one shot + video capability! (X, Blog, DEMO)* Google open sources Gemma 2 2.6B (Blog, HF)* MTEB Arena launching on HF - Embeddings head to head (HF)* Arcee AI announces DistillKit - (X, Blog, Github)* AI Art & Diffusion & 3D* Black Forest Labs - FLUX new SOTA diffusion models (X, Blog, Try It)* Midjourney 6.1 update - greater realism + potential Grok integration (X)* Big CO LLMs + APIs* Google updates Gemini 1.5 Pro with 0801 release and is #1 on LMsys arena (X)* OpenAI started alpha GPT-4o voice mode (examples)* OpenAI releases SearchGPT (Blog, Comparison w/ PPXL)* Apple releases beta of iOS 18.1 with Apple Intelligence (X, hands on, Intents )* Apple released a technical paper of apple intelligence* This weeks Buzz* AI Salons in SF + New Weave course for WandB featuring yours truly!* Vision & Video* Runway ML adds Gen -3 image to video and makes it 7x faster (X)* Tools & Hardware* Avi announces friend.com* Jeremy Howard releases FastHTML (Site, Video)* Applied LLM course from Hamel dropped all videosOpen SourceIt feels like everyone and their grandma is open sourcing incredible AI this week! Seriously, get ready for segment-anything-you-want + real-time-video capability PLUS small AND powerful language models.Meta Gives Us SAM-2: Segment ANYTHING Model in Images & Videos... With One Click!Hold on to your hats, folks! Remember Segment Anything, Meta's already-awesome image segmentation model? They've just ONE-UPPED themselves. Say hello to SAM-2 - it's real-time, promptable (you can TELL it what to segment), and handles VIDEOS like a champ. As I said on the show: "I was completely blown away by segment anything 2".But wait, what IS segmentation? Basically, pixel-perfect detection - outlining objects with incredible accuracy. My guests, the awesome Piotr Skalski and Joseph Nelson (computer vision pros from Roboflow), broke it down historically, from SAM 1 to SAM 2, and highlighted just how mind-blowing this upgrade is."So now, Segment Anything 2 comes out. Of course, it has all the previous capabilities of Segment Anything ... But the segment anything tool is awesome because it also can segment objects on the video". - Piotr SkalskiThink about Terminator vision from the "give me your clothes" bar scene: you see a scene, instantly "understand" every object separately, AND track it as it moves. SAM-2 gives us that, allowing you to click on a single frame, and BAM - perfect outlines that flow through the entire video! I played with their playground, and you NEED to try it - you can blur backgrounds, highlight specific objects... the possibilities are insane. Playground LinkIn this video, Piotr annotated only the first few frames of the top video, and SAM understood the bottom two shot from 2 different angles!Okay, cool tech, BUT why is it actually USEFUL? Well, Joseph gave us incredible examples - from easier sports analysis and visual effects (goodbye manual rotoscoping) to advances in microscopic research and even galactic exploration! Basically, any task requiring precise object identification gets boosted to a whole new level."SAM does an incredible job at creating pixel perfect outlines of everything inside visual scenes. And with SAM2, it does it across videos super well, too ... That capability is still being developed for a lot of AI Models and capabilities. So having very rich ability to understand what a thing is, where that thing is, how big that thing is, allows models to understand spaces and reason about them" - Joseph NelsonAND if you combine this power with other models (like Piotr is already doing!), you get zero-shot segmentation - literally type what you want to find, and the model will pinpoint it in your image/video. It's early days, but get ready for robotics applications, real-time video analysis, and who knows what else these clever hackers are dreaming up! 🤯Check out Piotr's Zero Shot Florence + Sam2 ImplementationBest of all? Apache 2 license, baby! As Joseph said, "Open source is foundational to making the accessibility, the use cases, and the advancement of the field overall", and this is a prime example. Huge kudos to Meta for empowering us with this tech.The whole conversation w/ Piotr & Joseph is very much worth listening to on the pod 🎙️Google Throws Down The Gauntlet: Open Sourcing GemMA 2 2.6BIt was Meta vs. Google on Monday because NOT to be outdone, Google also went on an open-sourcing spree. This time, they gifted us GemMA 2 (a 2.6 billion parameter powerhouse), alongside a safety-focused suite called ShieldGemMA AND a transparency tool called GemmaScope.So what makes Gemma 2 special? First off, it's optimized for on-device use, meaning super-efficient local running. BUT there's a catch, folks... They claim it beats Mixtral AND Llama 2 70B on the LMsys Arena leaderboard, with an ELO score of 1126. Hold on, a 2 billion parameter model outperforming the big boys? 🤨 As LDJ (one of my regular co-hosts) said on the show:"Yeah, I think my best theory here is... there's at least two or three variables at play ... In LMSys, people are much more likely to do single turn, and within LMSys, people will usually be biased more towards rating models with a more recent knowledge cutoff as higher".Translation? It might be gaming the system a bit, but either way, Gemma 2 is an exciting release - super fast, small enough for on-device applications, and coming with safety tools right out the gate! I think Zenova (our Hugging Face wizard) is already running this on WebGPU! You NEED to try it out.Gemma 2 HF LinkAnd GemmaScope? That's some cool, cool stuff too. Think about peeking inside the "brain" of the model - you can actually SEE how Gemma 2 processes information. Remember Anthropic Mechinterp? It's like that, giving us unprecedented transparency into how these systems actually "think". You gotta see it on Neuronpedia. Neuronpedia linkIt's Meta versus Google - round one, FIGHT! 🥊Distilling Knowlege: Arcee AI Drops DistilKit!Just when I thought the week was done throwing surprises, Arcee AI casually dropped DistilKit - an open source tool to build distilled language models. Now, this is some NEXT level stuff, folks. We talked with Lukas Atkins and Fernando (the brilliant minds behind DistillKit), and I finally learned what the heck "distillation" really means."TLDR - we teach a smaller model to think like a bigger model"In a nutshell: teach a smaller model how to think like a larger one. Think GPT-4o and GPT-4 Mini, where the smaller model supposedly got the "essence" of the bigger version. Or imagine a tiny Llama that inherited the smarts of 405B - ridiculous! 🤯 As Fernando eloquently put it:So in the finetuning that we have been doing, just in terms of generating text instructions and so on, we were observing only the token that was generated from the teacher model. And now with the distillation, we are observing the whole distribution of the tokens that could be sampledNow I admit, even after Fernando's expert breakdown, my brain still kind of melted. 🫠 BUT, here's why this matters: distilled models are super efficient, saving on cost and resources. Imagine powerful AI that runs seamlessly on your phone! 🤯 Arcee is making this possible for everyone.Check Out DistilKit HereWas it pure coincidence they released this on the same week as the Llama 3.1 LICENSE CHANGE (Zuckerberg is cl
Holy s**t, folks! I was off for two weeks, last week OpenAI released GPT-4o-mini and everyone was in my mentions saying, Alex, how are you missing this?? and I'm so glad I missed that last week and not this one, because while GPT-4o-mini is incredible (GPT-4o level distill with incredible speed and almost 99% cost reduction from 2 years ago?) it's not open source. So welcome back to ThursdAI, and buckle up because we're diving into what might just be the craziest week in open-source AI since... well, ever!This week, we saw Meta drop LLAMA 3.1 405B like it's hot (including updated 70B and 8B), Mistral joining the party with their Large V2, and DeepSeek quietly updating their coder V2 to blow our minds. Oh, and did I mention Google DeepMind casually solving math Olympiad problems at silver level medal 🥈? Yeah, it's been that kind of week.TL;DR of all topics covered: * Open Source* Meta LLama 3.1 updated models (405B, 70B, 8B) - Happy LLama Day! (X, Announcement, Zuck, Try It, Try it Faster, Evals, Provider evals)* Mistral Large V2 123B (X, HF, Blog, Try It)* DeepSeek-Coder-V2-0724 update (API only)* Big CO LLMs + APIs* 🥈 Google Deepmind wins silver medal at Math Olympiad - AlphaGeometry 2 (X)* OpenAI teases SearchGPT - their reimagined search experience (Blog)* OpenAI opens GPT-4o-mini finetunes + 2 month free (X)* This weeks Buzz* I compare 5 LLama API providers for speed and quantization using Weave (X)* Voice & Audio* Daily announces a new open standard for real time Voice and Video RTVI-AI (X, Try it, Github)Meta LLAMA 3.1: The 405B Open Weights Frontier Model Beating GPT-4 👑Let's start with the star of the show: Meta's LLAMA 3.1. This isn't just a 0.1 update; it's a whole new beast. We're talking about a 405 billion parameter model that's not just knocking on GPT-4's door – it's kicking it down.Here's the kicker: you can actually download this internet scale intelligence (if you have 820GB free). That's right, a state-of-the-art model beating GPT-4 on multiple benchmarks, and you can click a download button. As I said during the show, "This is not only refreshing, it's quite incredible."Some highlights:* 128K context window (finally!)* MMLU score of 88.6* Beats GPT-4 on several benchmarks like IFEval (88.6%), GSM8K (96.8%), and ARC Challenge (96.9%)* Has Tool Use capabilities (also beating GPT-4) and is Multilingual (ALSO BEATING GPT-4)But that's just scratching the surface. Let's dive deeper into what makes LLAMA 3.1 so special.The Power of Open WeightsMark Zuckerberg himself dropped an exclusive interview with our friend Rowan Cheng from Rundown AI. And let me tell you, Zuck's commitment to open-source AI is no joke. He talked about distillation, technical details, and even released a manifesto on why open AI (the concept, not the company) is "the way forward".As I mentioned during the show, "The fact that this dude, like my age, I think he's younger than me... knows what they released to this level of technical detail, while running a multi billion dollar company is just incredible to me."Evaluation ExtravaganzaThe evaluation results for LLAMA 3.1 are mind-blowing. We're not just talking about standard benchmarks here. The model is crushing it on multiple fronts:* MMLU (Massive Multitask Language Understanding): 88.6%* IFEval (Instruction Following): 88.6%* GSM8K (Grade School Math): 96.8%* ARC Challenge: 96.9%But it doesn't stop there. The fine folks at meta also for the first time added new categories like Tool Use (BFCL 88.5) and Multilinguality (Multilingual MGSM 91.6) (not to be confused with MultiModality which is not yet here, but soon) Now, these are official evaluations from Meta themselves, that we know, often don't really represent the quality of the model, so let's take a look at other, more vibey results shall we? On SEAL leaderboards from Scale (held back so can't be trained on) LLama 405B is beating ALL other models on Instruction Following, getting 4th at Coding and 2nd at Math tasks. On MixEval (the eval that approximates LMsys with 96% accuracy), my colleagues Ayush and Morgan got a whopping 66%, placing 405B just after Clause Sonnet 3.5 and above GPT-4oAnd there are more evals that all tell the same story, we have a winner here folks (see the rest of the evals in my thread roundup)The License Game-ChangerMeta didn't just release a powerful model; they also updated their license to allow for synthetic data creation and distillation. This is huge for the open-source community.LDJ highlighted its importance: "I think this is actually pretty important because even though, like you said, a lot of people still train on OpenAI outputs anyways, there's a lot of legal departments and a lot of small, medium, and large companies that they restrict the people building and fine-tuning AI models within that company from actually being able to build the best models that they can because of these restrictions."This update could lead to a boom in custom models and applications across various industries as companies can start distilling, finetuning and creating synthetic datasets using these incredibly smart models.405B: A Double-Edged SwordWhile the 405B model is incredibly powerful, it's not exactly practical for most production use cases as you need 2 nodes of 8 H100s to run it in full precision. Despite the fact that pricing wars already started, and we see inference providers at as low as 2.7$/1M tokens, this hardly makes sense when GPT-4o mini is 15 cents. However, this model shines in other areas:* Synthetic Data Generation & Distillation: Its power and the new license make it perfect for creating high-quality training data and use it to train smaller models* LLM as a Judge: The model's reasoning capabilities make it an excellent candidate for evaluating other AI outputs.* Research and Experimentation: For pushing the boundaries of what's possible in AI.The Smaller Siblings: 70B and 8BWhile the 405B model is grabbing headlines, don't sleep on its smaller siblings. The 70B and 8B models got significant upgrades too.The 70B model saw impressive gains:* MMLU: 80.9 to 86* IFEval: 82 to 87* GPQA: 39 to 46The 8B model, in particular, could be a hidden gem. As Kyle Corbitt from OpenPipe discovered, a fine-tuned 8B model could potentially beat a prompted GPT-4 Mini in specific tasks.No multi-modalityWhile Meta definitely addressed everything we had to ask for from the Llama 3 release, context window, incredible performance, multi-linguality, tool-use, we still haven't seen multi-modality with Llama. We still can't show it pictures or talk to it! However, apparently they have trained it to be mutli-modal as well but haven't yet released those weights, but they went into this in great detail in the paper and even showed a roadmap, stating that they will release it soon-ish (not in EU though)This Week's Buzz: Weave-ing Through LLama ProvidersIn the spirit of thorough evaluation, I couldn't resist putting LLAMA 3.1 through its paces across different providers. Using Weights & Biases Weave (https://wandb.me/weave), our evaluation and tracing framework for LLMs, I ran a comparison between various LLAMA providers.Here's what I found:* Different providers are running the model with varying optimizations (VLLM, FlashAttention3, etc.)* Some are serving quantized versions, which can affect output style and quality* Latency and throughput vary significantly between providersThe full results are available in a Weave comparison dashboard, which you can check out for a deep dive into the nuances of model deployment and code is up on Github if you want to verify this yourself or see how easy this is to do with WeaveMistral Crashes the Party with Large V2 123B model (X, HF, Blog, Try It)Just when we thought Meta had stolen the show, Mistral AI decided to drop their own bombshell: Mistral Large V2. This 123 billion parameter dense model is no joke, folks. With an MMLU score of 84.0, 128K context window and impressive performance across multiple benchmarks, it's giving LLAMA 3.1 a run for its money, especially in some coding tasks while being optimized to run on a single node!Especially interesting is the function calling on which they claim SOTA, without telling us which metric they used (or comparing to Llama 3.1) but are saying that they now support parallel and sequential function calling! DeepSeek updates DeepSeek Coder V2 to 0724While everyone was busy gawking at Meta and Mistral, DeepSeek quietly updated their coder model, and holy smokes, did they deliver! DeepSeek Coder v2 is now performing at GPT-4 and Claude 3.5 Sonnet levels on coding tasks. As Junyang Lin noted during our discussion, "DeepSeek Coder and DeepSeek Coder v2 should be the state of the art of the code-specific model."Here's the result from BigCodeBench and from Aider Chat (code editing dashboard)But it's not just about raw performance. DeepSeek is bringing some serious innovation to the table. They've added JSON mode, function calling, and even a fill-in-the-middle completion feature in beta. Plus, they've bumped up their max token generation to 8K. And let's talk about that API pricing – it's ridiculously cheap, at 14c / 1M tokens!. We're talking about costs that are competitive with GPT-4 Mini, but with potentially better performance on coding tasks. It's a game-changer for developers and companies looking to integrate powerful coding AI without breaking the bank.Google DeepMind's Math Wizardry: From Silver Medals to AI ProdigiesJust when we thought this week couldn't get any crazier, Google DeepMind decides to casually drop a bombshell that would make even the most decorated mathletes sweat. They've created an AI system that can solve International Mathematical Olympiad (IMO) problems at a silver medalist level. I mean, come on! As if the AI world wasn't moving fast enough, now we've got silicon-based Math Olympians?This isn't just any run-of-the-mill calculator on steroids. We're talking about a combination of AlphaProof, a new breakthrough
Hey all, Alex here… well, not actually here, I’m scheduling this post in advance, which I haven’t yet done, because I'm going on vacation! That’s right, next week is my birthday 🎉 and a much needed break, somewhere with a beach is awaiting, but I didn’t want to leave you hanging for too long, so posting this episode with some amazing un-released before material. Mixture of Agents x2Back in the far away days of June 20th (not that long ago but feels like ages!), Together AI announced a new paper, released code and posted a long post about a new method to collaboration between smaller models to beat larger models. They called it Mixture of Agents, and James Zou joined us to chat about that effort. Shortly after that - in fact, during the live ThursdAI show, Kyle Corbitt announced that OpenPipe also researched an approached similar to the above, using different models and a bit of a different reasoning, and also went after the coveted AlpacaEval benchmark, and achieved SOTA score of 68.8 using this method. And I was delighted to invite both James and Kyle to chat about their respective approach the same week that both broke AlpacaEval SOTA and hear how utilizing collaboration between LLMs can significantly improve their outputs! This weeks buzz - what I learned at W&B this weekSo much buzz this week from the Weave team, it’s hard to know what to put in here. I can start with the incredible integrations my team landed, Mistral AI, LLamaIndex, DSPy, OpenRouter and even Local Models served by Ollama, LmStudio, LLamaFile can be now auto tracked with Weave, which means you literally have to only instantiate Weave and it’ll auto track everything for you! But I think the biggest, hugest news from this week is this great eval comparison system that the Weave Tim just pushed, it’s honestly so feature rich that I’ll have to do a deeper dive on it later, but I wanted to make sure I include at least a few screencaps because I think it looks fantastic! Open Router - A unified interface for LLMsI’ve been a long time fan of OpenRouter.ai and I was very happy to have Alex Atallah on the show to talk about Open Router (even if this did happen back in April!) and I’m finally satisfied with the sound quality to released this conversation. Open Router is serving both foundational models like GPT, Claude, Gemini and also Open Source ones, and supports the OpenAI SDK format, making it super simple to play around and evaluate all of them on the same code. They even provide a few models for free! Right now you can use Phi for example completely free via their API. Alex goes deep into the areas of Open Router that I honestly didn’t really know about, like being a marketplace, knowing what trendy LLMs are being used by people in near real time (check out WebSim!) and more very interesting things! Give that conversation a listen, I’m sure you’ll enjoy it! That’s it folks, no news this week, I would instead like to recommend a new newsletter by friends of the pod Tanishq Abraham and Aran Komatsuzaki both of whom are doing a weekly paper X space and recently start posting it on Substack as well! It’s called AI papers of the week, and if you’re into papers which we don’t usually cover, there’s no better duo! In fact, Tanishq often used to come to ThursdAI to explain papers so you may recognize his voice :) See you all in two weeks after I get some seriously needed R&R 👋 😎🏖️ This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
Hey everyone! Happy 4th of July to everyone who celebrates! I celebrated today by having an intimate conversation with 600 of my closest X friends 😂 Joking aside, today is a celebratory episode, 52nd consecutive weekly ThursdAI show! I've been doing this as a podcast for a year now!Which means, there are some of you, who've been subscribed for a year 😮 Thank you! Couldn't have done this without you. In the middle of my talk at AI Engineer (I still don't have the video!) I had to plug ThursdAI, and I asked the 300+ audience who is a listener of ThursdAI, and I saw a LOT of hands go up, which is honestly, still quite humbling. So again, thank you for tuning in, listening, subscribing, learning together with me and sharing with your friends! This week, we covered a new (soon to be) open source voice model from KyutAI, a LOT of open source LLM, from InternLM, Cognitive Computations (Eric Hartford joined us), Arcee AI (Lukas Atkins joined as well) and we have a deep dive into GraphRAG with Emil Eifrem CEO of Neo4j (who shares why it was called Neo4j in the first place, and that he's a ThursdAI listener, whaaat? 🤯), this is definitely a conversation you don't want to miss, so tune in, and read a breakdown below:TL;DR of all topics covered: * Voice & Audio* KyutAI releases Moshi - first ever 7B end to end voice capable model (Try it)* Open Source LLMs * Microsoft Updated Phi-3-mini - almost a new model * InternLM 2.5 - best open source model under 12B on Hugging Face (HF, Github)* Microsoft open sources GraphRAG (Announcement, Github, Paper)* OpenAutoCoder-Agentless - SOTA on SWE Bench - 27.33% (Code, Paper)* Arcee AI - Arcee Agent 7B - from Qwen2 - Function / Tool use finetune (HF)* LMsys announces RouteLLM - a new Open Source LLM Router (Github)* DeepSeek Chat got an significant upgrade (Announcement)* Nomic GPT4all 3.0 - Local LLM (Download, Github)* This weeks Buzz* New free Prompts course from WandB in 4 days (pre sign up)* Big CO LLMs + APIs* Perplexity announces their new pro research mode (Announcement)* X is rolling out "Grok Analysis" button and it's BAD in "fun mode" and then paused roll out* Figma pauses the rollout of their AI text to design tool "Make Design" (X)* Vision & Video* Cognitive Computations drops DolphinVision-72b - VLM (HF)* Chat with Emil Eifrem - CEO Neo4J about GraphRAG, AI EngineerVoice & AudioKyutAI Moshi - a 7B end to end voice model (Try It, See Announcement)Seemingly out of nowhere, another french AI juggernaut decided to drop a major announcement, a company called KyutAI, backed by Eric Schmidt, call themselves "the first European private-initiative laboratory dedicated to open research in artificial intelligence" in a press release back in November of 2023, have quite a few rockstar co founders ex Deep Mind, Meta AI, and have Yann LeCun on their science committee.This week they showed their first, and honestly quite mind-blowing release, called Moshi (Japanese for Hello, Moshi Moshi), which is an end to end voice and text model, similar to GPT-4o demos we've seen, except this one is 7B parameters, and can run on your mac! While the utility of the model right now is not the greatest, not remotely close to anything resembling the amazing GPT-4o (which was demoed live to me and all of AI Engineer by Romain Huet) but Moshi shows very very impressive stats! Built by a small team during only 6 months or so of work, they have trained an LLM (Helium 7B) an Audio Codec (Mimi) a Rust inference stack and a lot more, to give insane performance. Model latency is 160ms and mic-to-speakers latency is 200ms, which is so fast it seems like it's too fast. The demo often responds faster than I'm able to finish my sentence, and it results in an uncanny, "reading my thoughts" type feeling. The most important part is this though, a quote of KyutAI post after the announcement : Developing Moshi required significant contributions to audio codecs, multimodal LLMs, multimodal instruction-tuning and much more. We believe the main impact of the project will be sharing all Moshi’s secrets with the upcoming paper and open-source of the model.I'm really looking forward to how this tech can be applied to the incredible open source models we already have out there! Speaking to out LLMs is now officially here in the Open Source, way before we got GPT-4o and it's exciting! Open Source LLMs Microsoft stealth update Phi-3 Mini to make it almost a new modelSo stealth in fact, that I didn't even have this update in my notes for the show, but thanks to incredible community (Bartowsky, Akshay Gautam) who made sure we don't miss this, because it's so huge. The model used additional post-training data leading to substantial gains on instruction following and structure output. We also improve multi-turn conversation quality, explicitly support <|system|> tag, and significantly improve reasoning capabilityPhi-3 June update is quite significant across the board, just look at some of these scores, 354.78% improvement in JSON structure output, 30% at GPQABut also specifically for coding, a 33→93 jump in Java coding, 33→73 in Typescript, 27→ 85 in Python! These are just incredible numbers, and I definitely agree with Bartowski here, there's enough here to call this a whole new model rather than an "seasonal update" Qwen-2 is the start of the show right now Week in and week out, ThursdAI seems to be the watercooler for the best finetuners in the community to come, hang, share notes, and announce their models. A month after Qwen-2 was announced on ThursdAI stage live by friend of the pod and Qwen dev lead Junyang Lin, and a week after it re-took number 1 on the revamped open LLM leaderboard on HuggingFace, we now have great finetunes on top of Qwen-2. Qwen-2 is the star of the show right now. Because there's no better model. This is like GPT 4 level. It's Open Weights GPT 4. We can do what we want with it, and it's so powerful, and it's multilingual, and it's everything, it's like the dream model. I love itEric Hartford - Cognitive ComputationsWe've had 2 models finetunes based on Qwen 2 and their authors on the show this week, first was Lukas Atkins from Arcee AI (company behind MergeKit), they released Arcee Agent, a 7B Qwen-2 finetune/merge specifically focusing on tool use and function calling. We also had a chat with Eric Hartford from Cognitive Computations (which Lukas previously participated in) with the biggest open source VLM on top of Qwen-2, a 72B parameter Dolphin Vision (Trained by StableQuan, available on the HUB) ,and it's likely the biggest open source VLM that we've seen so far.The most exciting part about it, is Fernando Neta's "SexDrugsRockandroll" dataset, which supposedly contains, well.. a lot of uncensored stuff, and it's perfectly able to discuss and analyze images with mature and controversial content.InternLM 2.5 - SOTA open source under 12B with 1M context (HF, Github)The folks at Shanghai AI release InternLM 2.5 7B, and a chat version along with a whopping 1M context window extension. These metrics are ridiculous, beating LLama-3 8B on literally every metric on the new HF leaderboard, and even beating Llama-3 70B on MATH and coming close on GPQA!The folks at Intern not only released a beast of a model, but also have released a significantly imporved tool use capabilities with it, including their own agentic framework called Lagent, which comes with Code Interpreter (python execution), Search Capabilities, and of course the abilities to plug in your own tools.How will you serve 1M context on production you ask? Well, these folks ALSO open sourced LMDeploy, "an efficient, user-friendly toolkit designed for compressing, deploying, and serving LLM models" which has been around for a while, but is now supporting this new model of course, handles dynamic NTK and some offloading of context etc' So an incredible model + tools release, can't wait to play around with this! ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.This weeks Buzz (What I learned with WandB this week)Hey, did you know we at Weights & Biases have free courses? While some folks ask you for a LOT of money for basic courses, at Weights & Biases, they are... you guessed it, completely free! And a lot of effort goes into recording and building the agenda, so I'm happy to announce that our "Developer's Guide to LLM Prompting" course is going to launch in 4 days! Delivered by my colleague Anish (who's just an amazing educator) and Teodora from AutogenAI, you will learn everything prompt building related, and even if you are a seasoned prompting pro, there will be something for you there! Pre-register for the course HEREBig CO LLMs + APIsHow I helped roll back an XAI feature and Figma rolled back theirs We've covered Grok (with a K this time) from XAI multiple times, and while I don't use it's chat interface that much, or the open source model, I do think they have a huge benefit in having direct access to real time data from the X platform. Given that I basically live on X (to be able to deliver all these news to you) I started noticing a long promised, Grok Analysis button show up under some posts, first on mobile, then on web versions of X. Of course I had to test it, and whoa, I was honestly shocked at just how unhinged and profanity laced the analysis was. Now I'm not easily shocked, I've seen jailbroken LLMs before, I tried to get chatGPT to say curse words multiple times, but it's one thing when you expect it and a complete another thing when a billion dollar company releases a product that answers... well like this: Luckily Igor Babushkin (Co founder of XAI) noticed and the roll out was paused, so looks like I helped red team grok! 🫡 Figma pauses AI "make design" featureAnother AI feature was paused by a big company after going viral on X (what is it about X specifically?) and this time it was Figma! In a supe
Hey everyone, sending a quick one today, no deep dive, as I'm still in the middle of AI Engineer World's Fair 2024 in San Francisco (in fact, I'm writing this from the incredible floor 32 presidential suite, that the team here got for interviews, media and podcasting, and hey to all new folks who I’ve just met during the last two days!) It's been an incredible few days meeting so many ThursdAI community members, listeners and folks who came on the pod! The list honestly is too long but I've got to meet friends of the pod Maxime Labonne, Wing Lian, Joao Morra (crew AI), Vik from Moondream, Stefania Druga not to mention the countless folks who came up and gave high fives, introduced themselves, it was honestly a LOT of fun. (and it's still not over, if you're here, please come and say hi, and let's take a LLM judge selfie together!)On today's show, we recorded extra early because I had to run and play dress up, and boy am I relieved now that both the show and the talk are behind me, and I can go an enjoy the rest of the conference 🔥 (which I will bring you here in full once I get the recording!) On today's show, we had the awesome pleasure to have Surya Bhupatiraju who's a research engineer at Google DeepMind, talk to us about their newly released amazing Gemma 2 models! It was very technical, and a super great conversation to check out! Gemma 2 came out with 2 sizes, a 9B and a 27B parameter models, with 8K context (we addressed this on the show) and this 27B model incredible performance is beating LLama-3 70B on several benchmarks and is even beating Nemotron 340B from NVIDIA! This model is also now available on the Google AI studio to play with, but also on the hub! We also covered the renewal of the HuggingFace open LLM leaderboard with their new benchmarks in the mix and normalization of scores, and how Qwen 2 is again the best model that's tested! It's was a very insightful conversation, that's worth listening to if you're interested in benchmarks, definitely give it a listen. Last but not least, we had a conversation with Ethan Sutin, the co-founder of Bee Computer. At the AI Engineer speakers dinner, all the speakers received a wearable AI device as a gift, and I onboarded (cause Swyx asked me) and kinda forgot about it. On the way back to my hotel I walked with a friend and chatted about my life. When I got back to my hotel, the app prompted me with "hey, I now know 7 new facts about you" and it was incredible to see how much of the conversation it was able to pick up, and extract facts and eve TODO's! So I had to have Ethan on the show to try and dig a little bit into the privacy and the use-cases of these hardware AI devices, and it was a great chat! Sorry for the quick one today, if this is the first newsletter after you just met me and register, usually there’s a deeper dive here, expect a more in depth write-ups in the next sessions, as now I have to run down and enjoy the rest of the conference! Here's the TL;DR and my RAW show notes for the full show, in case it's helpful! * AI Engineer is happening right now in SF* Tracks include Multimodality, Open Models, RAG & LLM Frameworks, Agents, Al Leadership, Evals & LLM Ops, CodeGen & Dev Tools, Al in the Fortune 500, GPUs & Inference* Open Source LLMs * HuggingFace - LLM Leaderboard v2 - (Blog)* Old Benchmarks sucked and it's time to renew* New Benchmarks* MMLU-Pro (Massive Multitask Language Understanding - Pro version, paper)* GPQA (Google-Proof Q&A Benchmark, paper). GPQA is an extremely hard knowledge dataset* MuSR (Multistep Soft Reasoning, paper).* MATH (Mathematics Aptitude Test of Heuristics, Level 5 subset, paper)* IFEval (Instruction Following Evaluation, paper)* 🤝 BBH (Big Bench Hard, paper). BBH is a subset of 23 challenging tasks from the BigBench dataset* The community will be able to vote for models, and we will prioritize running models with the most votes first* Mozilla announces Builders Accelerator @ AI Engineer (X)* Theme: Local AI * 100K non dilutive funding* Google releases Gemma 2 (X, Blog)* Big CO LLMs + APIs* UMG, Sony, Warner sue Udio and Suno for copyright (X)* were able to recreate some songs* sue both companies* have 10 unnamed individuals who are also on the suit* Google Chrome Canary has Gemini nano (X)* * Super easy to use window.ai.createTextSession()* Nano 1 and 2, at a 4bit quantized 1.8B and 3.25B parameters has decent performance relative to Gemini Pro* Behind a feature flag* Most text gen under 500ms * Unclear re: hardware requirements * Someone already built extensions* someone already posted this on HuggingFace* Anthropic Claude share-able projects (X)* Snapshots of Claude conversations shared with your team* Can share custom instructions* Anthropic has released new "Projects" feature for Claude AI to enable collaboration and enhanced workflows* Projects allow users to ground Claude's outputs in their own internal knowledge and documents* Projects can be customized with instructions to tailor Claude's responses for specific tasks or perspectives* "Artifacts" feature allows users to see and interact with content generated by Claude alongside the conversation* Claude Team users can share their best conversations with Claude to inspire and uplevel the whole team* North Highland consultancy has seen 5x faster content creation and analysis using Claude* Anthropic is committed to user privacy and will not use shared data to train models without consent* Future plans include more integrations to bring in external knowledge sources for Claude* OpenAI voice mode update - not until Fall* AI Art & Diffusion & 3D* Fal open sourced AuraSR - a 600M upscaler based on GigaGAN (X, Fal)* Interview with Ethan Sutin from Bee Computer* We all got Bees as a gifts* AI Wearable that extracts TODOs, knows facts, etc'* This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
loading