A PyTorch GPU Memory Leak Example

I ran into this GPU memory leak issue when building a PyTorch training pipeline. After spending quite some time, I finally figured out this minimal reproducible example.

import torch

class AverageMeter(object):
    """
    Keeps track of most recent, average, sum, and count of a metric.
    """

    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

device: torch.device = torch.device("cuda:0")
model = torch.hub.load('pytorch/vision:v0.9.0', 'resnet18', pretrained=False)
model.to(device)
model.train()

data = torch.zeros(size=[1,3,128,128],device=device,dtype=torch.float)
loss_avg = AverageMeter()
for i in range(1000):
    outputs = model(data)
    loss = outputs.mean()
    loss_avg.update(loss)
    if i % 10 == 0:
        print ('Loss, current batch {:.3f}, moving average {:.3f}'.format(loss_avg.val, loss_avg.avg))
        print ('GPU Memory Allocated {} MB'.format(torch.cuda.memory_allocated(device=device)/1024./1024.))

Kicking off the training, it shows constantly increasing allocated GPU memory.

Using cache found in /home/haoxiangli/.cache/torch/hub/pytorch_vision_v0.9.0
Loss, current batch -0.001, moving average -0.001
GPU Memory Allocated 51.64306640625 MB
Loss, current batch -0.001, moving average -0.001
GPU Memory Allocated 119.24072265625 MB
Loss, current batch -0.001, moving average -0.001
GPU Memory Allocated 186.83837890625 MB
Loss, current batch -0.001, moving average -0.001
GPU Memory Allocated 254.43603515625 MB
Loss, current batch -0.001, moving average -0.001
GPU Memory Allocated 322.03369140625 MB
Loss, current batch -0.001, moving average -0.001
GPU Memory Allocated 389.63134765625 MB
Loss, current batch -0.001, moving average -0.001

This “AverageMeter” has been used in many popular repositories (e.g., https://github.com/facebookresearch/moco). It’s by-design tracking the average of a given value and can be used to track training speed, loss value, and etc.

class AverageMeter(object):
    """
    Keeps track of most recent, average, sum, and count of a metric.
    """

    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

It comes with some existing training pipeline code but once I got this into my codebase, I started to use it elsewhere.

The implementation is straightforward and bug-free but it turns out there is something tricky here.

Following is a modified version without the GPU memory leak problem:

import torch

class AverageMeter(object):
    """
    Keeps track of most recent, average, sum, and count of a metric.
    """

    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

device: torch.device = torch.device("cuda:0")
model = torch.hub.load('pytorch/vision:v0.9.0', 'resnet18', pretrained=False)
model.to(device)
model.train()

data = torch.zeros(size=[1,3,128,128],device=device,dtype=torch.float)
loss_avg = AverageMeter()
for i in range(1000):
    outputs = model(data)
    loss = outputs.mean()
    loss_avg.update(loss.item()) # <-----
    if i % 10 == 0:
        print ('Loss, current batch {:.3f}, moving average {:.3f}'.format(loss_avg.val, loss_avg.avg))
        print ('GPU Memory Allocated {} MB'.format(torch.cuda.memory_allocated(device=device)/1024./1024.))

The annotated line is the little nuance. When something part of the computation graph is tracked with the “AverageMeter”, somehow PyTorch stops releasing related part of GPU memory. The fix is to cast it into a plain value beforehand.

This paradigm may show in the codebase in other forms. Once there is some utility classes being implemented, it is quite easy to accidentally use it over trainable PyTorch tensors. I do feel this is a bug of PyTorch (at least <=1.8.0) though.

网: 阿加西自传

https://book.douban.com/subject/30164685/

我不懂网球,读这本书之前我对这个人一无所知。单纯看到评分不错,就找来读了。读过感觉是很诚恳的自传,没有炫耀,没有鸡汤。从自己儿时被父亲逼着训练开始,能够从作者的文字中感受到他内心的自我矛盾。阿加西说他讨厌网球,他讨厌训练,他很叛逆,不论是在学校还是在网球训练基地。但是他也很想赢。整本书尤其前半段,阿加西的生活中都有种很纠结的情绪在里面。到了后半部分,格拉芙走进他生活之后,阿加西字里行间都是对格拉芙的爱慕和欣赏,格拉芙让他获得了内心的平静。

阿加西应该是一个内心世界极为丰富的人,作为一名顶级网球运动员,通过他对比赛的描述,感觉到在赛场上,他不是在和对手比技术技巧,而是比谁更能控制自己的内心。有几次他心态失控的时候,作为种子选手却第一场就输掉出局。每次叙述比赛的时候不是在说谁的球技更高超,而是在强调谁更加专注,谁更自省,谁更想赢。

书以第一人称的角度写的,但不是回忆的视角,是以当时的阿加西的口吻描述的。而且叙述上很真实的反映那个时期他的想法和情绪,很有画面感。确实是很好的一本自传。

Factfulness

最近读了这本 Factfulness (事实),感觉挺有趣的。作者介绍了十个思考工具以应对我们常见的思维误区。也是因为作者经历丰富,无论是正面还是反面都以自己的生动的经历为例,读起来完全不觉得枯燥,时常感觉有所启发。

一分为二就是一个常见的误区,非黑即白通常是片面的论断、事物处在的阶段不会是只有低和高,一定有中间的过渡状态。而事实上处于中间状态的可能是大多数。这里作者以发展中国家和发达国家举例,虽然这个概念大家经常提,但是从各个维度看各个国家的指标,其实大多数国家是处于两者之间的。作者提到可以尝试把两段改成四段,不仅限于这个概念,常常会让我们注意到大多数所处的位置。

书里面强调的比较多的一个思维误区是负面思维。按照作者观点,其实大多数人对世界所处的状态的估计是比实际情况更加负面的。其中一个重要的成因是媒体的选择性宣传,或者说媒体本质上就是倾向于报道事故、悲剧、矛盾等等有话题性的事件。打破既有印象的方式,可以是通过数据客观的看到事情的变化方向。同时承认事情可以处于目前依然很坏,但已经一直在变好的状态。书里面有很多例子,数据来支撑“世界正在变好”这个论点,有好些数字确实和我自己认为的大相径庭。有一个用来表明大家生活普遍在变好的指标比较有趣。作者用人均拥有吉他数随着这些年的一个上升趋势来说明世界正在变好。2014年差不多每100个人里面就有1个人有吉他。这个比例还真是挺高的,差不多是现在美国新冠的阳性率了 XD

另外映象比较深的是作者讲的自己的故事,都是作为反面例子说明思维误区的负面影响。一个是埃博拉在非洲爆发期间,作者在情急之下没有考虑全面提出的封城决定,间接导致几个人的死亡。另一个是过去作者曾经按照当时的正确做法让仰卧的婴儿俯卧,增大了婴儿可能窒息的概率。

读到这两个故事的时候我很惊讶于作者的坦诚,也很敬佩他能够把这些作为实例分享出来的勇气。其实严格说,作者并没有做错什么,但是放在这里作为例子确直接把作者的决定和事情的结果联系起来,想来他自己应该也曾经为此颇有内疚吧。

总的来说,很有趣很好读的一本书。