Tag: bug

A PyTorch GPU Memory Leak Example

I ran into this GPU memory leak issue when building a PyTorch training pipeline. After spending quite some time, I finally figured out this minimal reproducible example.

import torch

class AverageMeter(object):
    """
    Keeps track of most recent, average, sum, and count of a metric.
    """

    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

device: torch.device = torch.device("cuda:0")
model = torch.hub.load('pytorch/vision:v0.9.0', 'resnet18', pretrained=False)
model.to(device)
model.train()

data = torch.zeros(size=[1,3,128,128],device=device,dtype=torch.float)
loss_avg = AverageMeter()
for i in range(1000):
    outputs = model(data)
    loss = outputs.mean()
    loss_avg.update(loss)
    if i % 10 == 0:
        print ('Loss, current batch {:.3f}, moving average {:.3f}'.format(loss_avg.val, loss_avg.avg))
        print ('GPU Memory Allocated {} MB'.format(torch.cuda.memory_allocated(device=device)/1024./1024.))

Kicking off the training, it shows constantly increasing allocated GPU memory.

Using cache found in /home/haoxiangli/.cache/torch/hub/pytorch_vision_v0.9.0
Loss, current batch -0.001, moving average -0.001
GPU Memory Allocated 51.64306640625 MB
Loss, current batch -0.001, moving average -0.001
GPU Memory Allocated 119.24072265625 MB
Loss, current batch -0.001, moving average -0.001
GPU Memory Allocated 186.83837890625 MB
Loss, current batch -0.001, moving average -0.001
GPU Memory Allocated 254.43603515625 MB
Loss, current batch -0.001, moving average -0.001
GPU Memory Allocated 322.03369140625 MB
Loss, current batch -0.001, moving average -0.001
GPU Memory Allocated 389.63134765625 MB
Loss, current batch -0.001, moving average -0.001

This “AverageMeter” has been used in many popular repositories (e.g., https://github.com/facebookresearch/moco). It’s by-design tracking the average of a given value and can be used to track training speed, loss value, and etc.

class AverageMeter(object):
    """
    Keeps track of most recent, average, sum, and count of a metric.
    """

    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

It comes with some existing training pipeline code but once I got this into my codebase, I started to use it elsewhere.

The implementation is straightforward and bug-free but it turns out there is something tricky here.

Following is a modified version without the GPU memory leak problem:

import torch

class AverageMeter(object):
    """
    Keeps track of most recent, average, sum, and count of a metric.
    """

    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

device: torch.device = torch.device("cuda:0")
model = torch.hub.load('pytorch/vision:v0.9.0', 'resnet18', pretrained=False)
model.to(device)
model.train()

data = torch.zeros(size=[1,3,128,128],device=device,dtype=torch.float)
loss_avg = AverageMeter()
for i in range(1000):
    outputs = model(data)
    loss = outputs.mean()
    loss_avg.update(loss.item()) # <-----
    if i % 10 == 0:
        print ('Loss, current batch {:.3f}, moving average {:.3f}'.format(loss_avg.val, loss_avg.avg))
        print ('GPU Memory Allocated {} MB'.format(torch.cuda.memory_allocated(device=device)/1024./1024.))

The annotated line is the little nuance. When something part of the computation graph is tracked with the “AverageMeter”, somehow PyTorch stops releasing related part of GPU memory. The fix is to cast it into a plain value beforehand.

This paradigm may show in the codebase in other forms. Once there is some utility classes being implemented, it is quite easy to accidentally use it over trainable PyTorch tensors. I do feel this is a bug of PyTorch (at least <=1.8.0) though.

[Bug] g++4.6 参数顺序

遇到一个bug, 看起来像是g++-4.6的问题。

问题是这样的。这个源文件用到了OpenCV:

//< file: test.cpp
#include 

int main (int argc, char** argv) {
    cv::Mat image;
    return 0;
}

用这样一行命令编译:

g++-4.6 `pkg-config --libs opencv`  -o test.bin test.cpp

遇到了错误:

/tmp/ccs2MlQz.o: In function `cv::Mat::~Mat()':
test.cpp:(.text._ZN2cv3MatD2Ev[_ZN2cv3MatD5Ev]+0x39): undefined reference to `cv::fastFree(void*)'
/tmp/ccs2MlQz.o: In function `cv::Mat::release()':
test.cpp:(.text._ZN2cv3Mat7releaseEv[cv::Mat::release()]+0x47): undefined reference to `cv::Mat::deallocate()'
collect2: ld returned 1 exit status

错误的原因应该是g++没有正确的链接到OpenCV的库。各种尝试之后发现只要调换一下参数的位置就可以正常编译 -_-!!
改用这样一行命令编译就没有问题了。

g++-4.6 test.cpp `pkg-config --libs opencv`  -o test.bin

具体原因不明,但是如果把g++-4.6换作g++-4.4就没有这个问题。
这行命令也可以正常编译:

g++-4.4 `pkg-config --libs opencv`  -o test.bin test.cpp 

这么看起来很有可能是g++-4.6的bug,或者是改进..?

libstdc++ 4.6 type_traits 的一个bug

如果你的编译环境和我一样,然后又在用C++11的时候,不小心直接或者间接用到了<chrono>这个头文件,应该就会遇到这个bug。

完整的错误信息如下:

|| In file included from /usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../include/c++/4.6/thread:37:
/usr/include/c++/4.6/chrono|240 col 10| error: cannot cast from lvalue of type 'const long' to rvalue reference type 'rep' (aka 'long &&'); types are not compatible
||           : __r(static_cast(__rep)) { }
||                 ^~~~~~~~~~~~~~~~~~~~~~~
/usr/include/c++/4.6/chrono|128 col 13| note: in instantiation of function template specialization 'std::chrono::duration >::duration' requested here
||             return _ToDur(static_cast(__d.count()));
||                    ^
/usr/include/c++/4.6/chrono|182 col 9| note: in instantiation of function template specialization 'std::chrono::__duration_cast_impl<std::chrono::duration >, std::ratio, long &&, true, true>::__cast >' requested here
||         return __dc::__cast(__d);
||                ^
/usr/include/c++/4.6/chrono|247 col 10| note: in instantiation of function template specialization 'std::chrono::duration_cast<std::chrono::duration >, long, std::ratio >' requested here
||           : __r(duration_cast(__d).count()) { }
||                 ^
/usr/include/c++/4.6/chrono|466 col 9| note: in instantiation of function template specialization 'std::chrono::duration >::duration, void>' requested here
||         return __ct(__lhs).count() < __ct(__rhs).count();
||                ^
/usr/include/c++/4.6/chrono|667 col 7| note: in instantiation of function template specialization 'std::chrono::operator<, long, std::ratio >' requested here
||                     < system_clock::duration::zero(),
||                     ^
/usr/include/c++/4.6/chrono|128 col 13| error: call to implicitly-deleted copy constructor of 'std::chrono::duration >'
||             return _ToDur(static_cast(__d.count()));
||                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/c++/4.6/chrono|182 col 9| note: in instantiation of function template specialization 'std::chrono::__duration_cast_impl<std::chrono::duration >, std::ratio, long &&, true, true>::__cast >' requested here
||         return __dc::__cast(__d);
||                ^
/usr/include/c++/4.6/chrono|247 col 10| note: in instantiation of function template specialization 'std::chrono::duration_cast<std::chrono::duration >, long, std::ratio >' requested here
||           : __r(duration_cast(__d).count()) { }
||                 ^
/usr/include/c++/4.6/chrono|466 col 9| note: in instantiation of function template specialization 'std::chrono::duration >::duration, void>' requested here
||         return __ct(__lhs).count() < __ct(__rhs).count();
||                ^
/usr/include/c++/4.6/chrono|667 col 7| note: in instantiation of function template specialization 'std::chrono::operator<, long, std::ratio >' requested here
||                     < system_clock::duration::zero(),
||                     ^
/usr/include/c++/4.6/chrono|233 col 12| note: explicitly defaulted function was implicitly deleted here
||         constexpr duration(const duration&) = default;
||                   ^
/usr/include/c++/4.6/chrono|349 col 6| note: copy constructor of 'duration >' is implicitly deleted because field '__r' is of rvalue reference type 'rep' (aka 'long &&')
||         rep __r;
||             ^
/usr/include/c++/4.6/chrono|182 col 9| error: call to implicitly-deleted copy constructor of 'typename enable_if<__is_duration<duration > >::value, duration > >::type' (aka 'std::chrono::duration >')
||         return __dc::__cast(__d);
||                ^~~~~~~~~~~~~~~~~
/usr/include/c++/4.6/chrono|247 col 10| note: in instantiation of function template specialization 'std::chrono::duration_cast<std::chrono::duration >, long, std::ratio >' requested here
||           : __r(duration_cast(__d).count()) { }
||                 ^
/usr/include/c++/4.6/chrono|466 col 9| note: in instantiation of function template specialization 'std::chrono::duration >::duration, void>' requested here
||         return __ct(__lhs).count() < __ct(__rhs).count();
||                ^
/usr/include/c++/4.6/chrono|667 col 7| note: in instantiation of function template specialization 'std::chrono::operator<, long, std::ratio >' requested here
||                     < system_clock::duration::zero(),
||                     ^
/usr/include/c++/4.6/chrono|233 col 12| note: explicitly defaulted function was implicitly deleted here
||         constexpr duration(const duration&) = default;
||                   ^
/usr/include/c++/4.6/chrono|349 col 6| note: copy constructor of 'duration >' is implicitly deleted because field '__r' is of rvalue reference type 'rep' (aka 'long &&')
||         rep __r;
||             ^
/usr/include/c++/4.6/chrono|255 col 11| error: rvalue reference to type 'long' cannot bind to lvalue of type 'long'
||         { return __r; }
||                  ^~~
/usr/include/c++/4.6/chrono|247 col 39| note: in instantiation of member function 'std::chrono::duration >::count' requested here
||           : __r(duration_cast(__d).count()) { }
||                                              ^
/usr/include/c++/4.6/chrono|466 col 9| note: in instantiation of function template specialization 'std::chrono::duration >::duration, void>' requested here
||         return __ct(__lhs).count() < __ct(__rhs).count();
||                ^
/usr/include/c++/4.6/chrono|667 col 7| note: in instantiation of function template specialization 'std::chrono::operator<, long, std::ratio >' requested here
||                     < system_clock::duration::zero(),
||                     ^
/usr/include/c++/4.6/chrono|666 col 21| error: static_assert expression is not an integral constant expression
||       static_assert(system_clock::duration::min()
||                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/c++/4.6/chrono|666 col 21| note: undefined function 'operator<, long, std::ratio >' cannot be used in a constant expression
/usr/include/c++/4.6/chrono|460 col 7| note: declared here
||       operator& __lhs,
||       ^
/usr/include/c++/4.6/chrono|141 col 40| error: cannot cast from lvalue of type 'const intmax_t' (aka 'const long') to rvalue reference type 'long &&'; types are not compatible
||               static_cast(__d.count()) / static_cast(_CF::den)));
||                                               ^~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/c++/4.6/chrono|182 col 9| note: in instantiation of function template specialization 'std::chrono::__duration_cast_impl<std::chrono::duration >, std::ratio, long &&, true, false>::__cast >' requested here
||         return __dc::__cast(__d);
||                ^
/usr/include/c++/4.6/chrono|679 col 21| note: in instantiation of function template specialization 'std::chrono::duration_cast<std::chrono::duration >, long, std::ratio >' requested here
||         return std::time_t(duration_cast
||                            ^
/usr/include/c++/4.6/chrono|154 col 40| error: cannot cast from lvalue of type 'const intmax_t' (aka 'const long') to rvalue reference type 'long &&'; types are not compatible
||               static_cast(__d.count()) * static_cast(_CF::num)));
||                                               ^~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/c++/4.6/chrono|182 col 9| note: in instantiation of function template specialization 'std::chrono::__duration_cast_impl<std::chrono::duration >, std::ratio, long &&, false, true>::__cast >' requested here
||         return __dc::__cast(__d);
||                ^
/usr/include/c++/4.6/chrono|577 col 22| note: in instantiation of function template specialization 'std::chrono::duration_cast<std::chrono::duration >, long, std::ratio >' requested here
||         return __time_point(duration_cast(__t.time_since_epoch()));
||                             ^
/usr/include/c++/4.6/chrono|687 col 9| note: in instantiation of function template specialization 'std::chrono::time_point_cast<std::chrono::duration >, std::chrono::system_clock, std::chrono::duration > >' requested here
||         return time_point_cast
||                ^
|| 7 errors generated.

Google搜到了这个帖子 http://urfoex.blogspot.com/2012/06/c11-fixing-bug-to-use-chrono-with-clang.html
和我的环境略有不同,但是出错的地方是一样的。

我自己的环境是

Ubuntu 12.04, clang 3.1, libstdc++ 4.6

解决方法是改

/usr/include/c++/4.6/type_traits

这个文件的第1113行

{ typedef decltype(true ? declval<_Tp>() : declval<_Up>()) type; };

改成

{ typedef typename decay() : declval<_Up>())>::type type; }

出错的原因,大概是因为没有正确的做类型转换。decay这个函数我从没用过,看文档是C++11引入的函数,用做类型转换。对此不理解,不能在这里误导人,文档原文如下

Applies lvalue-to-rvalue, array-to-pointer, and function-to-pointer implicit conversions to the type T, removes cv-qualifiers, and defines the resulting type as the member typedef type. This is the type conversion applied to all function arguments when passed by value.

总之,把那行改掉再重新编译自己的工程就没有问题了。

如果你是和我一样没有root权限去改/usr/include下的文件,也可以这么做:
1. 把/usr/include/c++/4.6/type_traits拷贝到自己的目录下面,比如

$cp /usr/include/c++/4.6/type_traits $HOME/local/include/c++/4.6

2. 改自己目录下的版本,然后编译的时候把这个目录放到编译选项里,clang会优先使用这个目录下的版本

$clang++ blahblah.cpp -I$HOME/local/include/c++/4.6/

好吧,虽然没有完全搞清楚这个bug是怎么回事,但是解决掉一大堆报错的感觉挺好 :]