Queda temporaria no Cluster

Description

Sao 10h56. O ganglia afirma que somente a spg00 esta up. O painel frontal da SPRAID esta piscando. Entretanto o resultado do condor :
[mdias@sprace mdias]$ ssh spgrid '. /OSG/setup.sh ;condor_status'

Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime

vm1@node01.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:04
vm2@node01.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:43
vm1@node02.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:04
vm2@node02.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:41
vm1@node03.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:04
vm2@node03.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:39
vm1@node04.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+00:24:57
vm2@node04.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:41
vm1@node05.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+21:49:34
vm2@node05.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:05
vm1@node06.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:04
vm2@node06.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:39
vm1@node07.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:04
vm2@node07.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:40
vm1@node08.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:04
vm2@node08.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:37
vm1@node09.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:39:57
vm2@node09.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:05
vm1@node10.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:04
vm2@node10.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:40
vm1@node11.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:05
vm2@node11.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:36
vm1@node12.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:05
vm2@node12.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:39
vm1@node13.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+05:44:37
vm2@node13.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:05
vm1@node14.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:04
vm2@node14.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:40
vm1@node15.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:25:04
vm2@node15.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:25:35
vm1@node16.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:04
vm2@node16.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:35
vm1@node17.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:04
vm2@node17.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:39
vm1@node18.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:04
vm2@node18.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:37
vm1@node21.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:30:05
vm2@node21.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:30:41
vm1@node22.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:30:04
vm2@node22.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:30:38
vm1@node23.gr LINUX       INTEL  Unclaimed  Idle       0.000  1003  0+01:30:04
vm2@node23.gr LINUX       INTEL  Unclaimed  Idle       0.000  1003  1+01:30:35
vm1@spgrid.if LINUX       INTEL  Unclaimed  Idle       1.000  1003  0+02:10:04
vm2@spgrid.if LINUX       INTEL  Unclaimed  Idle       10.460  1003  1+02:10:53

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

         INTEL/LINUX    44     0       0        44       0          0        0

               Total    44     0       0        44       0          0        0

Os nos aceitam ping

[root@sprace:root]# ping node38
PING node38.cluster (192.168.1.38) from 192.168.1.200 : 56(84) bytes of data.
64 bytes from node38.cluster (192.168.1.38): icmp_seq=1 ttl=64 time=0.193 ms
64 bytes from node38.cluster (192.168.1.38): icmp_seq=2 ttl=64 time=0.190 ms

--- node38.cluster ping statistics ---
2 packets transmitted, 2 received, 0% loss, time 999ms
rtt min/avg/max/mdev = 0.190/0.191/0.193/0.013 ms
mas entrar via ssh nao. Na spraid
[mdias@spraid mdias]$ df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda2              2063536    599940   1358772  31% /
none                   1027720         0   1027720   0% /dev/shm
/dev/sda7              1035660     34728    948324   4% /tmp
/dev/sda5             10317828   2196156   7597556  23% /usr
/dev/sda8             15346304   1444488  13122264  10% /usr/local
/dev/sda6              2063504    413860   1544824  22% /var
/dev/sdb1            1833096736  92955700 1647025088   6% /raid0
/dev/sdc1            1833096736 963934088 776046700  56% /raid1
/dev/sdd1            1730092600 264919452 1377289568  17% /raid2
/dev/sde1            1730092600 225076752 1417132268  14% /raid3
/dev/sdf1            1730092600 208326584 1433882436  13% /raid4
/dev/sdg1            1730092600 220788532 1421420488  14% /raid5
spdc00:/pnfsdoors       400000     80000    284000  22% /pnfs/if.usp.br
[mdias@spraid mdias]$ free
             total       used       free     shared    buffers     cached
Mem:       2055440    2038452      16988          0     581612    1175224
-/+ buffers/cache:     281616    1773824
Swap:      4192956      12724    4180232

Tambem estamos down no http://cms-project-phedex.web.cern.ch/cms-project-phedex/cgi-bin/browser.

Updates

11h05. Nao mexi em nada e estamos ok novamente no Phedex e no ganglia, mas extremamente instavel, com alguns nodes caindo de tempos em tempos. Os logs estao ok. A spg00 esta " pingavel" mas atingiu o pico de utilizacao,. Estou logado na spraid

[mdias@spraid mdias]$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2             2.0G  586M  1.3G  31% /
none                 1004M     0 1004M   0% /dev/shm
/dev/sda7            1012M   34M  927M   4% /tmp
/dev/sda5             9.9G  2.1G  7.3G  23% /usr
/dev/sda8              15G  1.4G   13G  10% /usr/local
/dev/sda6             2.0G  405M  1.5G  22% /var
/dev/sdb1             1.8T   89G  1.6T   6% /raid0
/dev/sdc1             1.8T  920G  741G  56% /raid1
/dev/sdd1             1.7T  250G  1.3T  16% /raid2
/dev/sde1             1.7T  212G  1.4T  14% /raid3
/dev/sdf1             1.7T  198G  1.4T  13% /raid4
/dev/sdg1             1.7T  210G  1.4T  14% /raid5
spdc00:/pnfsdoors     391M   79M  278M  22% /pnfs/if.usp.br
[mdias@spraid mdias]$ free
             total       used       free     shared    buffers     cached
Mem:       2055440    2038660      16780          0     579184    1174588
-/+ buffers/cache:     284888    1770552
Swap:      4192956      12684    4180272
Topic revision: r1 - 2006-09-28 - MarcoAndreFerreiraDias
 

This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback

antalya escort bursa escort eskisehir escort istanbul escort izmir escort