推理 | 标签 | Blog

论文收藏夹

技术杂货铺

Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models

ZipLM: Inference-Aware Structured Pruning of Language Models

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Fast Inference from Transformers via Speculative Decoding

分块并行解码（Blockwise Parallel Decoding）

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

你好！我是

Dylan

Dylan

type

status

date

slug

summary

tags

category

icon

password

2023.10.08建站