This commit fixes the bug in #2232 where cadvisor was not able to detect
the cloud provider if it's running on a custom AMI derived from
Amazon Linux 2.
It does so by checking /etc/os-release. However, from what I've read,
/etc/os-release is pretty much a systemd thing. Although Amazon Linux 2
comes with systemd, cadvisor cannot assume the existence of systemd in
other AMIs / OSes, therefore we would only be checking for
/etc/os-release if all other methods fail us.
context: kubernetes/kubernetes#68478
The inotify code was removed from golang.org/x/exp several years ago. Therefore
importing it from that path prevents downstream consumers from using any module
that makes use of more recent features of golang.org/x/exp.
Given that this code is by definition frozen and that the long term path should
be to migrate to fsnotify, replacing the current code by an identical standalone
copy doesn't have maintenance cost, and will unlock other activities for
kubernetes for example.
The oomparser logic would end up stuck, unable to detect the end of a
given oom trace, for any process with a name that didn't match \w+.
This includes processes like 'python3.4' due to the '.', or
'docker-containerd' due to the '-'.
This fix was included in pr #1544 last year, but since that PR seems
dead it seems like a good idea to break this more important fix out.
I've updated the tests such that they would have caught this issue.
IN_ATTRIB inotify events are generated when atime / mtime is changed,
which would cause the tail to be reset, and reread the same log
again (generating duplicate events). Instead, watch the directory for
file delete / move.
Also, use an exponential backoff when retrying opening the file.
This ensures each goroutine is given its own Netlink connection, and
presumably avoids having a message destined for one goroutine read by
another goroutine.
Not closing the FDs manually means we have to rely on garbage collection
to run before cgroup FDs are closed. If the system is running a lot of
load probes at a high-frequency (i.e. dynamic housekeeping isn't backing
off because of load variations), we can end up hitting our FD limit due
to keeping around lots of (useless) FDs.