pytorch 问题汇总

Posted by HieDean on April 3, 2020

RuntimeError:running_mean should contain 12 elements not 3084

检查一下nn.BatchNorm2d(channels)channels是否设置正确

训练时出现 Warning:NaN or Inf found in input tensor

查了查说是因为梯度爆炸或梯度消失,也就是梯度过大或梯度过小。看看模型的参数哪个设置的不太合理(尤其是learning_rate)修改一下再试试。 不过我最后发现其实是我的模型写错了,所以训练之前仔细检查一下自己的模型以及损失函数是不是写对了,不要懒

torch.cuda.is_available()返回false

换了台服务器重新搭环境 环境搭好之后发现torch.cuda.is_available()返回false确实懵了一下 诶,别多想了,一定是版本问题,查去吧

查看显卡驱动版本

nvidia-smi

查看cuda版本

nvcc -Vcat /usr/local/cuda/version.txt

查看cudnn版本

cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2

查看pytorch版本
python 
>>import torch 
>>torch.__version__

GPU跑模型报错 RuntimeError: CUDA out of memory. Tried to allocate 38.00 MiB (GPU 0; 10.76 GiB total capacity; 9.71 GiB already allocated; 5.56 MiB free; 9.82 GiB reserved in total by PyTorch)

  • 一般有三个原因
    1. GPU还有其他进程占用显存,导致本进程无法分配到足够的显存
    2. 缓存过多,使用torch.cuda.empty_cache()清理缓存
    3. 卡不行,换块显存更大的卡吧

除此之外,注意pytorch在test时,一定要加上

with torch.no_grad():
  # test process

否则会使显存加倍导致OOM错误

invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number

根据https://github.com/NVIDIA/flownet2-pytorch/issues/113 把所有的xxx.data[0]改成xxx.data即可

使用了DataParallel后报错UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greately increasing memory usage. To compact weights again call flatten_parameters()

def forward():
    self.LSTM.flatten_parameters()

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed

使用testTensor.detach()将重复更新梯度的张量设为叶子节点

打印模型参数总量

total = sum([param.nelement() for param in model.parameters()])
print("Number of parameter: %.2fM" % (total/1e6))

pytorch获取Tensor的值

testTensor.item()

numpy读取.npy文件时出现Cannot load file containing pickled data

data = numpy.load('xxx.npy', encoding='bytes', allow_pickle=True)

python运行错误:RuntimeError: CUDA error: invalid device ordinal

一般来说是因为指定的GPU过于实际拥有的GPU,查看一下指定GPU的代码吧

ImportError: bad magic number in : b’\x03\xf3\r\n’ Error

在命令行里把项目下所有的.pyc文件删了

find . -name "*.pyc" -exec rm -f {} \;

torch.load()时选择设备

device = torch.device('cpu')
checkpoint = torch.load('xxx.pth', map_location=device)

pycharm debug时变量值显示不出或显示太慢

Preference --> python debugger --> Gevent compatible 勾选后重跑