常规步骤请参考自定义 INT4 Block 量化从 llm-compressor 到 vLLM 完整讲解-CSDN博客这篇文章主要讲解一下另一条非常规但是特别简单的添加量化策略的路径从上一篇文章我们知道添加自定义的量化策略是需要通过继承Modifier自己实现一边完整流程包括计算scale等。但是量化整体上来区分无非也就一下几种group分组block分块per-channelper-tokenper-tensor这些其实在compressor里面都有的只不过定义在另一个依赖包compressed-tensors里compressed-tensors\src\compressed_tensors\quantization\quant_scheme.py文件太长就不完全粘贴过来了留两个示例# 4 bit integer weights only quantization W4A16 dict( weightsQuantizationArgs( num_bits4, typeQuantizationType.INT, strategyQuantizationStrategy.GROUP, group_size128, symmetricTrue, dynamicFalse, ), ) # 4 bit integer weights only asymmetric quantization W4A16_ASYM dict( weightsQuantizationArgs( num_bits4, typeQuantizationType.INT, strategyQuantizationStrategy.GROUP, group_size128, symmetricFalse, dynamicFalse, ), ) FP8_BLOCK dict( weightsQuantizationArgs( num_bits8, typeQuantizationType.FLOAT, strategyQuantizationStrategy.BLOCK, symmetricTrue, dynamicFalse, block_structure[128, 128], ), input_activationsQuantizationArgs( num_bits8, typeQuantizationType.FLOAT, strategyQuantizationStrategy.GROUP, symmetricTrue, dynamicTrue, group_size128, ), )那如果我们直接自己组合一个新的QuantizationArgs能不能行呢QuantizationArgs(num_bits4,typeQuantizationType.INT,strategyQuantizationStrategy.GROUP,group_size128,symmetricTrue,dynamicFalse,)答案是还真行下面来详细讲解一些怎么通过这种方式实现上一篇文章中的int4_block量化1. 定义新的QuantizationArgs有两种方式可以实现自行选取方式一复制 W4A16 的所有字段只改分组方式from compressed_tensors.quantization import ( QuantizationArgs, QuantizationStrategy, QuantizationType, ) W4A16_BLOCK dict( weightsQuantizationArgs( **{**W4A16[weights].model_dump(), # 继承 W4A16 的所有字段 strategy: QuantizationStrategy.BLOCK, group_size: None, # 清掉 group 配置 block_structure: [16, 16]}, # 加上 block 配置 ), )方式二直接写新配置更清晰from compressed_tensors.quantization import ( QuantizationArgs, QuantizationStrategy, QuantizationType, ) W4A16_BLOCK dict( weightsQuantizationArgs( num_bits4, typeQuantizationType.INT, strategyQuantizationStrategy.BLOCK, block_structure[16, 16], symmetricTrue, dynamicFalse, ), )这样写是不是简单多了避免了继承 Modifier 的复杂流程2. 注册新的scheme这是最重要的一步把这个自定义的scheme注册到quant_scheme中from compressed_tensors.quantization import quant_scheme if W4A16_BLOCK not in quant_scheme.PRESET_SCHEMES: quant_scheme.PRESET_SCHEMES[W4A16_BLOCK] W4A16_BLOCK print([register_block_scheme] W4A16_BLOCK registered)完整代码 register_custom_scheme.pyImport this module to register W4A16_BLOCK preset scheme. from compressed_tensors.quantization import ( QuantizationArgs, QuantizationStrategy, QuantizationType, ) from compressed_tensors.quantization import quant_scheme W4A16_BLOCK dict( weightsQuantizationArgs( num_bits4, typeQuantizationType.INT, strategyQuantizationStrategy.BLOCK, block_structure[16, 16], symmetricTrue, dynamicFalse, ), ) if W4A16_BLOCK not in quant_scheme.PRESET_SCHEMES: quant_scheme.PRESET_SCHEMES[W4A16_BLOCK] W4A16_BLOCK print([register_block_scheme] W4A16_BLOCK registered)量化时只要导入register_custom_scheme.py 执行注册就可以直接使用了import register_block_scheme # register custom quant scheme recipe QuantizationModifier(targetsLinear, schemeW4A16_BLOCK, ignore[lm_head]) oneshot(modelmodel, reciperecipe, pipelinedatafree)这样是不是就简单多了不过这是针对llm-compressor这部分量化sheme的组合想要在vllm顺利执行推理还要在vllm侧添加对应的scheme分发路由可以参考上一篇文章自定义 INT4 Block 量化从 llm-compressor 到 vLLM 完整讲解-CSDN博客
llm-compressor添加新量化策略 -- 邪修方法
发布时间:2026/5/24 1:33:58
常规步骤请参考自定义 INT4 Block 量化从 llm-compressor 到 vLLM 完整讲解-CSDN博客这篇文章主要讲解一下另一条非常规但是特别简单的添加量化策略的路径从上一篇文章我们知道添加自定义的量化策略是需要通过继承Modifier自己实现一边完整流程包括计算scale等。但是量化整体上来区分无非也就一下几种group分组block分块per-channelper-tokenper-tensor这些其实在compressor里面都有的只不过定义在另一个依赖包compressed-tensors里compressed-tensors\src\compressed_tensors\quantization\quant_scheme.py文件太长就不完全粘贴过来了留两个示例# 4 bit integer weights only quantization W4A16 dict( weightsQuantizationArgs( num_bits4, typeQuantizationType.INT, strategyQuantizationStrategy.GROUP, group_size128, symmetricTrue, dynamicFalse, ), ) # 4 bit integer weights only asymmetric quantization W4A16_ASYM dict( weightsQuantizationArgs( num_bits4, typeQuantizationType.INT, strategyQuantizationStrategy.GROUP, group_size128, symmetricFalse, dynamicFalse, ), ) FP8_BLOCK dict( weightsQuantizationArgs( num_bits8, typeQuantizationType.FLOAT, strategyQuantizationStrategy.BLOCK, symmetricTrue, dynamicFalse, block_structure[128, 128], ), input_activationsQuantizationArgs( num_bits8, typeQuantizationType.FLOAT, strategyQuantizationStrategy.GROUP, symmetricTrue, dynamicTrue, group_size128, ), )那如果我们直接自己组合一个新的QuantizationArgs能不能行呢QuantizationArgs(num_bits4,typeQuantizationType.INT,strategyQuantizationStrategy.GROUP,group_size128,symmetricTrue,dynamicFalse,)答案是还真行下面来详细讲解一些怎么通过这种方式实现上一篇文章中的int4_block量化1. 定义新的QuantizationArgs有两种方式可以实现自行选取方式一复制 W4A16 的所有字段只改分组方式from compressed_tensors.quantization import ( QuantizationArgs, QuantizationStrategy, QuantizationType, ) W4A16_BLOCK dict( weightsQuantizationArgs( **{**W4A16[weights].model_dump(), # 继承 W4A16 的所有字段 strategy: QuantizationStrategy.BLOCK, group_size: None, # 清掉 group 配置 block_structure: [16, 16]}, # 加上 block 配置 ), )方式二直接写新配置更清晰from compressed_tensors.quantization import ( QuantizationArgs, QuantizationStrategy, QuantizationType, ) W4A16_BLOCK dict( weightsQuantizationArgs( num_bits4, typeQuantizationType.INT, strategyQuantizationStrategy.BLOCK, block_structure[16, 16], symmetricTrue, dynamicFalse, ), )这样写是不是简单多了避免了继承 Modifier 的复杂流程2. 注册新的scheme这是最重要的一步把这个自定义的scheme注册到quant_scheme中from compressed_tensors.quantization import quant_scheme if W4A16_BLOCK not in quant_scheme.PRESET_SCHEMES: quant_scheme.PRESET_SCHEMES[W4A16_BLOCK] W4A16_BLOCK print([register_block_scheme] W4A16_BLOCK registered)完整代码 register_custom_scheme.pyImport this module to register W4A16_BLOCK preset scheme. from compressed_tensors.quantization import ( QuantizationArgs, QuantizationStrategy, QuantizationType, ) from compressed_tensors.quantization import quant_scheme W4A16_BLOCK dict( weightsQuantizationArgs( num_bits4, typeQuantizationType.INT, strategyQuantizationStrategy.BLOCK, block_structure[16, 16], symmetricTrue, dynamicFalse, ), ) if W4A16_BLOCK not in quant_scheme.PRESET_SCHEMES: quant_scheme.PRESET_SCHEMES[W4A16_BLOCK] W4A16_BLOCK print([register_block_scheme] W4A16_BLOCK registered)量化时只要导入register_custom_scheme.py 执行注册就可以直接使用了import register_block_scheme # register custom quant scheme recipe QuantizationModifier(targetsLinear, schemeW4A16_BLOCK, ignore[lm_head]) oneshot(modelmodel, reciperecipe, pipelinedatafree)这样是不是就简单多了不过这是针对llm-compressor这部分量化sheme的组合想要在vllm顺利执行推理还要在vllm侧添加对应的scheme分发路由可以参考上一篇文章自定义 INT4 Block 量化从 llm-compressor 到 vLLM 完整讲解-CSDN博客