Category: Solution

A PyTorch GPU Memory Leak Example

I ran into this GPU memory leak issue when building a PyTorch training pipeline. After spending quite some time, I finally figured out this minimal reproducible example.

import torch

class AverageMeter(object):
    """
    Keeps track of most recent, average, sum, and count of a metric.
    """

    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

device: torch.device = torch.device("cuda:0")
model = torch.hub.load('pytorch/vision:v0.9.0', 'resnet18', pretrained=False)
model.to(device)
model.train()

data = torch.zeros(size=[1,3,128,128],device=device,dtype=torch.float)
loss_avg = AverageMeter()
for i in range(1000):
    outputs = model(data)
    loss = outputs.mean()
    loss_avg.update(loss)
    if i % 10 == 0:
        print ('Loss, current batch {:.3f}, moving average {:.3f}'.format(loss_avg.val, loss_avg.avg))
        print ('GPU Memory Allocated {} MB'.format(torch.cuda.memory_allocated(device=device)/1024./1024.))

Kicking off the training, it shows constantly increasing allocated GPU memory.

Using cache found in /home/haoxiangli/.cache/torch/hub/pytorch_vision_v0.9.0
Loss, current batch -0.001, moving average -0.001
GPU Memory Allocated 51.64306640625 MB
Loss, current batch -0.001, moving average -0.001
GPU Memory Allocated 119.24072265625 MB
Loss, current batch -0.001, moving average -0.001
GPU Memory Allocated 186.83837890625 MB
Loss, current batch -0.001, moving average -0.001
GPU Memory Allocated 254.43603515625 MB
Loss, current batch -0.001, moving average -0.001
GPU Memory Allocated 322.03369140625 MB
Loss, current batch -0.001, moving average -0.001
GPU Memory Allocated 389.63134765625 MB
Loss, current batch -0.001, moving average -0.001

This “AverageMeter” has been used in many popular repositories (e.g., https://github.com/facebookresearch/moco). It’s by-design tracking the average of a given value and can be used to track training speed, loss value, and etc.

class AverageMeter(object):
    """
    Keeps track of most recent, average, sum, and count of a metric.
    """

    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

It comes with some existing training pipeline code but once I got this into my codebase, I started to use it elsewhere.

The implementation is straightforward and bug-free but it turns out there is something tricky here.

Following is a modified version without the GPU memory leak problem:

import torch

class AverageMeter(object):
    """
    Keeps track of most recent, average, sum, and count of a metric.
    """

    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

device: torch.device = torch.device("cuda:0")
model = torch.hub.load('pytorch/vision:v0.9.0', 'resnet18', pretrained=False)
model.to(device)
model.train()

data = torch.zeros(size=[1,3,128,128],device=device,dtype=torch.float)
loss_avg = AverageMeter()
for i in range(1000):
    outputs = model(data)
    loss = outputs.mean()
    loss_avg.update(loss.item()) # <-----
    if i % 10 == 0:
        print ('Loss, current batch {:.3f}, moving average {:.3f}'.format(loss_avg.val, loss_avg.avg))
        print ('GPU Memory Allocated {} MB'.format(torch.cuda.memory_allocated(device=device)/1024./1024.))

The annotated line is the little nuance. When something part of the computation graph is tracked with the “AverageMeter”, somehow PyTorch stops releasing related part of GPU memory. The fix is to cast it into a plain value beforehand.

This paradigm may show in the codebase in other forms. Once there is some utility classes being implemented, it is quite easy to accidentally use it over trainable PyTorch tensors. I do feel this is a bug of PyTorch (at least <=1.8.0) though.

Header-only JPEG Save/Load

部署项目的时候偶尔会遇到这类需求,不能使用OpenCV之类的库来读写图片,比如需要部署到嵌入式环境之类的。网上找了一阵,发现对于jpeg的读写,这两个库很方便。

一个是NanoJPEG,有人做了一个C++版本。用来读jpeg文件。另一个是TinyJPEG,用来输出到文件。

都是header-only,只需要include头文件就可以用了。也没有标准库以外的依赖。

写了一个例子放在这里: Example-HeaderOnly-JPEG

  vector<uint8_t> image;
  int width, height;
  if (load_jpg_data(argv[1], image, width, height)) {
    cout << "size " << width << " x " << height << endl;

    vector<uint8_t> rotated_image(height*width*3);
    for (int y = 0; y < height; y++) {
      for (int x = 0; x < width; x++) {
        rotated_image[x*height*3+y*3+0] = image[y*width*3+x*3+0];
        rotated_image[x*height*3+y*3+1] = image[y*width*3+x*3+1];
        rotated_image[x*height*3+y*3+2] = image[y*width*3+x*3+2];
      }
    }
    tje_encode_to_file(argv[2], height, width, 3, rotated_image.data());
  } else {
    cout << "Failed to open file " << argv[1] << endl;
  }

Setup NextCloud with Docker

之前配置过一次 NextCloud,本地操作搭mysql和nginx既麻烦又不好维护。后来docker用得多一些了之后,觉得这种部署应该是由docker来支持的,找了一下确实有现成的解决方案,记录一下在这里。

简单说一下,NextCloud是本地云平台,通俗的说就是可以在自家的机器上配一套某度网盘 🙂
然后家里每个人的手机、平板、笔记本都可以访问。我们家的需求主要是共享和备份照片,共享大批量的陈年老照片,备份最近手机上拍的照片。

先clone一下这个repo:

git clone https://github.com/christophetd/nextcloud-docker-compose

按照需要,编辑docker-compose.yml 把自动备份相关的注释掉,就是 backups 下面的部分都可以删掉

编辑.env

 vim .env 

修改DATA_DIR指向自家云的存储位置,HOST到机器的IP (比如 192.168.0.123)

然后

docker-compose up -d

应该就可以了
如果没有 docker-compose 需要先安装:

 
sudo curl -L "https://github.com/docker/compose/releases/download/1.24.1/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose 
sudo chmod +x /usr/local/bin/docker-compose
sudo ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose 

然后打开浏览器访问 http://192.168.0.123 就设置帐号密码就可以用了。

看起来和某度网盘差不多。

可以尝试传一个文件上去,系统就会创建用户的文件夹了。具体的数据位置在 ${DATA_DIR}/data/${USERNAME}/files/ 下面。

到此基本上就可以用了,但是实际上大家都会面临的问题是怎么样把之前的数据迁移进去云里面。NextCloud的iOS/Andriod App都有很好的支持手机上传的支持,但是PC上总不能网页上一张一张照片的传。实际上是有工具可以做这个事情的。

我们可以把之前的数据,本地复制到NextCloud放数据的目录下,然后用NextCloud提供的命令行工具自动扫描更新到MySQL就好了。

 sudo rsync -zvrP /path/to/old/data ${DATA_DIR}/data/${USERNAME}/files/ 

之后要进到运行的docker container里面做操作:

 sudo docker ps 

列出来在跑的containers,找到 nextcloud 对应的 container ID,然后

 sudo docker exec -ti ${CONTAINER_ID} bash 

就到了container的环境里。

接下来要先安装sudo

 apt-get update && apt-get install -y sudo 

修正文件权限

 sudo chown -R www-data:www-data data/${USERNAME}/files/ 

重新扫描

 sudo -u www-data php occ files:scan --all 

文件多的话会比较慢,但是等着就好了。

用起来还是挺方便的,维护和部署也没有什么特别的操作。用Docker还是方便,希望Docker公司自己可以找到盈利点别挂了。