Local Deployment Guide for DeepSeek V3: Basics to Advanced

Environment Setup

Basic Requirements

GPU: You need an NVIDIA GPU like A100 or H100, or an AMD GPU.
Memory: Your system should have at least 32GB of memory, more is better.
Operating System: Use Linux, Ubuntu 20.04 or later is recommended.
Python: Make sure you have Python 3.8 or later installed.

Code and Model Preparation

Clone the Official Repository

First, clone the DeepSeek V3 repository:bashCopygit clone https://github.com/deepseek-ai/DeepSeek-V3.git cd DeepSeek-V3/inference pip install -r requirements.txt

Download Model Weights
- Get the official model weights from HuggingFace.
- Put the weight files in the right directory.

Deployment Options

1. DeepSeek-Infer Demo Deployment

This method is great for fast testing and trying things out:

python convert.py --hf-ckpt-path /path/to/DeepSeek-V3 \ --save-path /path/to/DeepSeek-V3-Demo \ --n-experts 256 \ --model-parallel 16 torchrun --nnodes 2 --nproc-per-node 8 generate.py \ --node-rank $RANK \ --master-addr $ADDR \ --ckpt-path /path/to/DeepSeek-V3-Demo \ --config configs/config_671B.json \ --interactive \ --temperature 0.7 \ --max-new-tokens 200

2. SGLang Deployment (Recommended)

SGLang v0.4.1 is the best choice for several reasons:

It supports MLA optimization.
It supports FP8 (W8A8).
It supports FP8 KV cache.
It supports Torch Compile.
It works with both NVIDIA and AMD GPUs.

3. LMDeploy Deployment (Recommended)

LMDeploy is top-notch for big deployments, offering:

Offline pipeline processing.
Online service deployment.
PyTorch workflow integration.
Optimized inference performance.

4. TRT-LLM Deployment (Recommended)

TensorRT-LLM is great because it includes:

BF16 and INT4/INT8 weight support.
Soon, it will support FP8.
It boosts inference speed.

5. vLLM Deployment (Recommended)

vLLM v0.6.6 is a top pick, offering:

FP8 and BF16 mode support.
Works with NVIDIA and AMD GPUs.
Pipeline parallelism capability.
Multi-machine distributed deployment.

Performance Optimization Tips

Memory Optimization

Use FP8 or INT8 quantization to reduce memory usage.
Enable KV cache optimization.
Set appropriate batch sizes.

Speed Optimization

Enable Torch Compile.
Use pipeline parallelism.
Optimize input/output processing.

Stability Optimization

Implement error handling mechanisms.
Add monitoring and logging.
Regularly check system resources.

Common Issues and Solutions

Memory Issues

Reduce batch size.
Use lower precision (e.g., FP8 or INT8).
Enable memory optimization options.

Performance Issues

Check GPU utilization.
Optimize model configuration.
Adjust parallel strategies.

Deployment Errors

Verify environment dependencies.
Check model weights.
Review detailed logs for troubleshooting.

Next Steps

Once the basic deployment is complete, you can:

Conduct performance benchmarking.
Optimize configuration parameters.
Integrate the model with existing systems.
Develop custom features tailored to your needs.

By following this guide, you’ve learned the key methods for deploying DeepSeek V3 locally. Choose the deployment strategy that aligns with your requirements and start building powerful AI applications!