TwelveLabs raises $100M Series B to build video understanding for enterprise AI agents
TwelveLabs closed a $100 million Series B round co-led by New Enterprise Associates (NEA) and NAVER Ventures, bringing cumulative funding to over $200 million. Investors include Amazon, Radical Ventures, Korea Investment Partners, Index Ventures, Quadrille Capital, and Red Bull Ventures. Amazon Web Services is making AWS TwelveLabs' preferred cloud partner with a multiyear commitment: new models will launch on AWS first and be optimized for AWS Trainium AI chips, expanding Amazon's non-NVIDIA video AI capabilities.
TwelveLabs builds Marengo 3.0, a video embedding model that converts raw footage (speech, sound, motion) into searchable semantic representations at scale, and Pegasus 1.5, a domain-specific language for video that enables reasoning over up to two hours of continuous context. Unlike generative video tools (Sora, Veo, Runway), TwelveLabs indexes and queries existing video—addressing an enterprise pain point: billions of hours of video archives (surveillance, broadcasts, sports, factory footage, medical records) remain opaque to AI systems because current LLMs only sample isolated frames. The company reported it has 178 employees, up from 58 a year ago, and operates from San Francisco and Seoul.
The $100M raise signals investor conviction in video-understanding as a category distinct from video generation. Venture funding for vertical AI has become increasingly selective, with average deal size doubling year-over-year even as deal count falls. TwelveLabs' ability to attract Amazon's strategic investment (beyond equity) suggests enterprise demand for queryable video archives is real and growing across media, compliance, security, and sports workflows.
For infrastructure and ML teams, the strategic value here is not size but structure: Amazon's Trainium commitment locks TwelveLabs' models into AWS first and creates native support for video reasoning workloads. As agents and autonomous systems move into roles requiring perception and reasoning about physical reality, video becomes the modality that matters most, making teams that can extract semantic meaning from recorded footage increasingly critical to production AI stacks.