AWS Lambda terminates containers with an uncatchable SIGKILL signal when a function exceeds its configured timeout. When a Spark-on-AWS-Lambda (SoAL) job is killed between Phase 1 (data upload) and Phase 2 (metadata commit) of a write, the result is silent data loss: orphaned Parquet files accumulate on S3 while the table's committed state remains unchanged and standard monitoring raises no alert. We characterize this vulnerability across Delta Lake and Apache Iceberg through 860 controlled kill-injection experiments at three dataset sizes. A SIGKILL landing in the inter-phase gap produced silent data loss in 100% of trials for both formats. We then present SafeWriter, a language-level wrapper that arms a watchdog thread 30 seconds before the Lambda timeout, triggers a format-native rollback via SQL, and records a checkpoint document on S3. SafeWriter converted every tested kill scenario into a clean, detectable rollback with under 100 ms added to normal write paths.
翻译:当函数执行超过配置的超时时间时,AWS Lambda会通过不可捕获的SIGKILL信号终止容器。若Spark-on-AWS-Lambda任务在写入流程的第一阶段(数据上传)与第二阶段(元数据提交)之间被终止,将导致静默数据丢失:孤立的Parquet文件在S3上持续累积,而表的已提交状态维持不变且标准监控机制无法触发告警。我们通过860次受控终止注入实验,在三种数据集规模下对Delta Lake和Apache Iceberg两种表格式进行了该漏洞的表征分析。实验结果表明,在两种格式中,当SIGKILL信号落在阶段间隙时,100%的测试用例均出现静默数据丢失。为此,我们提出SafeWriter解决方案——该语言级封装器在Lambda超时前30秒激活看门狗线程,通过SQL触发格式原生回滚机制,并在S3上记录检查点文档。SafeWriter将全部测试的终止场景转化为干净、可检测的回滚操作,且正常写入路径的额外开销低于100毫秒。