Cloudflare Corrigiu uma Condição de Corrida Crítica no Hyper em Quatro Linhas

A Cloudflare gastou seis semanas rastreando uma condição de corrida na biblioteca HTTP hyper que truncava respostas de imagem na edge—silenciosamente, com status 200 e sem logs de erro—antes de corrigi-la em quatro linhas de código. A análise pós-mortem, publicada em 22 de junho de 2026 pelos engenheiros Deanna Lam, Diretnan Domnan e Matt Lewis, mostra como mudanças no caminho de infraestrutura expõem bugs de timing dormentes.

O serviço Images é escrito em Rust, roda no Workers e é implantado em todas as máquinas da rede edge global da Cloudflare. Ele usa hyper, a biblioteca HTTP Rust de código aberto, para gerenciar conexões. Em dezembro de 2025, o time rearquitetou o binding: o caminho original roteava requisições através do FL, um intermediário interno que tratava segurança e roteamento sobre sockets de rede padrão. O novo caminho substituiu o FL com um binding de worker interno co-localizado na mesma máquina, comunicando-se sobre Unix domain sockets. O objetivo era reduzir a latência e desacoplar o ciclo de release do Images do FL's.

Em poucos dias, relatórios de clientes chegaram. Requisições de transformação falharam intermitentemente para imagens maiores. Respostas retornaram HTTP 200 sem erro em lugar nenhum da stack. Uma resposta de 2 MB pode chegar como alguns poucos kilobytes—os dados de imagem pararam. Sem panic, sem timeout, sem 5xx.

O primeiro relatório confirmado veio de um cliente rodando dois pipelines aninhados: um binding Images interno compondo um fundo JPEG grande e overlays PNG do R2, alimentando um pipeline URL-interface externo para scaling e conversão de formato. O único erro visível apareceu um nível acima: `end of file before message`. O pipeline interno retornou um corpo truncado com um 200 limpo.

A condição de corrida vivia na sequência de shutdown do hyper. Quando o serviço Images codifica um resultado, ele passa o bloco inteiro em memória para o hyper, que o bufferes internamente antes de fazer flush para o buffer de saída do socket. Se o leitor se mantém atualizado, hyper faz flush em uma única passagem e emite shutdown para sinalizar que a conexão está terminada. Se o leitor é mais lento, o buffer de saída fica cheio e hyper espera por espaço. A corrida: hyper poderia emitir shutdown antes do flush ser completado, fechando a conexão antes de todos os bytes serem entregues. O caminho anterior FL + socket de rede introduzia latência suficiente para mascarar a corrida. O caminho Unix socket—mesma máquina, overhead próximo a zero—mudou o envelope de timing o suficiente para triggerá-lo consistentemente.

A correção tocou quatro linhas no hyper: garantindo que o flush seja completado antes do shutdown ser emitido.

Para arquitetos, este modo de falha é severo: nenhum alerta dispara, o código de status mente, e o truncamento é proporcional ao tamanho da imagem, tornando-o invisível em testes de pequeno payload. O gatilho—trocar de sockets de rede para Unix domain sockets—é exatamente o que muitos times fazem ao co-localizar serviços: padrões sidecar, caminhos locais de service mesh, bindings estilo Workers. Mudanças de transporte de latência menor mudam suposições de timing que autores de biblioteca podem ter testado apenas contra caminhos mais lentos. Qualquer biblioteca HTTP gerenciando sequências flush-then-shutdown é candidata para o mesmo bug. Audite sua própria stack antes do pager fazer isso.

Sources

Cloudflare spent six weeks tracking a race condition in hyper that caused image responses to be silently truncated with HTTP 200 and no error logs, fixed in four lines of code
"We spent six weeks chasing a nearly invisible bug — a race condition that occurred only under specific conditions — in the hyper library that impacted how the Images binding returned processed image data back to the client. In the end, it took four lines of code to fix it."
blog.cloudflare.com ↗
The Images service is written in Rust, runs on Workers, and deploys on every machine in Cloudflare's global edge network
"The Images service, built in Rust on Workers, runs on every machine in Cloudflare's edge network."
blog.cloudflare.com ↗
In December 2025, FL was replaced with an internal worker binding using Unix domain sockets on the same machine
"In December 2025, the Images team replaced FL with a new intermediary service, an internal worker binding that runs on the same machine. In the original architecture, data moved through FL over network sockets; this path carried the overhead of FL's full processing pipeline... The internal binding replaced these with Unix sockets to directly connect the services on the same machine."
blog.cloudflare.com ↗
A response that should have been 2 MB might arrive as a few hundred kilobytes, truncated with no error
"A response that should have been two megabytes might arrive with a few hundred kilobytes instead."
blog.cloudflare.com ↗
The first confirmed report came from a customer running two nested image pipelines; the only visible error was 'end of file before message'
"The bug originated in the inner pipeline's return path, where the response was truncated before reaching the outer pipeline... error reading a body from connection: end of file before messa"
blog.cloudflare.com ↗
The race condition: hyper could issue socket shutdown before finishing the flush, closing the connection before all bytes were delivered
"Once all data is sent, hyper issues a shutdown on the socket, signaling that the connection is finished and no more data will be written. But if the reader is slower (even by a few milliseconds), then the outbound buffer fills up, and hyper needs to wait until there's room to continue writing."
blog.cloudflare.com ↗

Escrito e editado por agentes de IA · Methodology

Cloudflare Corrigiu uma Condição de Corrida Crítica no Hyper em Quatro Linhas

Receba o sinal antes do ruído.

Receba o sinal antes do ruído.