r1 - 28 Sep 2006 - 14:18:08 - MarcoAndreFerreiraDiasYou are here: TWiki >  Main Web > LogBook > EntryDescriptionNo9

Queda temporaria no Cluster

Description

Sao 10h56. O ganglia afirma que somente a spg00 esta up. O painel frontal da SPRAID esta piscando. Entretanto o resultado do condor :
[mdias@sprace mdias]$ ssh spgrid '. /OSG/setup.sh ;condor_status'

Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime

vm1@node01.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:04
vm2@node01.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:43
vm1@node02.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:04
vm2@node02.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:41
vm1@node03.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:04
vm2@node03.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:39
vm1@node04.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+00:24:57
vm2@node04.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:41
vm1@node05.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+21:49:34
vm2@node05.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:05
vm1@node06.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:04
vm2@node06.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:39
vm1@node07.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:04
vm2@node07.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:40
vm1@node08.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:04
vm2@node08.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:37
vm1@node09.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:39:57
vm2@node09.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:05
vm1@node10.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:04
vm2@node10.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:40
vm1@node11.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:05
vm2@node11.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:36
vm1@node12.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:05
vm2@node12.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:39
vm1@node13.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+05:44:37
vm2@node13.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:05
vm1@node14.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:04
vm2@node14.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:40
vm1@node15.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:25:04
vm2@node15.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:25:35
vm1@node16.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:04
vm2@node16.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:35
vm1@node17.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:04
vm2@node17.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:39
vm1@node18.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:35:04
vm2@node18.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:35:37
vm1@node21.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:30:05
vm2@node21.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:30:41
vm1@node22.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  0+01:30:04
vm2@node22.gr LINUX       INTEL  Unclaimed  Idle       0.000   500  1+01:30:38
vm1@node23.gr LINUX       INTEL  Unclaimed  Idle       0.000  1003  0+01:30:04
vm2@node23.gr LINUX       INTEL  Unclaimed  Idle       0.000  1003  1+01:30:35
vm1@spgrid.if LINUX       INTEL  Unclaimed  Idle       1.000  1003  0+02:10:04
vm2@spgrid.if LINUX       INTEL  Unclaimed  Idle       10.460  1003  1+02:10:53

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

         INTEL/LINUX    44     0       0        44       0          0        0

               Total    44     0       0        44       0          0        0

Os nos aceitam ping

[root@sprace:root]# ping node38
PING node38.cluster (192.168.1.38) from 192.168.1.200 : 56(84) bytes of data.
64 bytes from node38.cluster (192.168.1.38): icmp_seq=1 ttl=64 time=0.193 ms
64 bytes from node38.cluster (192.168.1.38): icmp_seq=2 ttl=64 time=0.190 ms

--- node38.cluster ping statistics ---
2 packets transmitted, 2 received, 0% loss, time 999ms
rtt min/avg/max/mdev = 0.190/0.191/0.193/0.013 ms
mas entrar via ssh nao. Na spraid
[mdias@spraid mdias]$ df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda2              2063536    599940   1358772  31% /
none                   1027720         0   1027720   0% /dev/shm
/dev/sda7              1035660     34728    948324   4% /tmp
/dev/sda5             10317828   2196156   7597556  23% /usr
/dev/sda8             15346304   1444488  13122264  10% /usr/local
/dev/sda6              2063504    413860   1544824  22% /var
/dev/sdb1            1833096736  92955700 1647025088   6% /raid0
/dev/sdc1            1833096736 963934088 776046700  56% /raid1
/dev/sdd1            1730092600 264919452 1377289568  17% /raid2
/dev/sde1            1730092600 225076752 1417132268  14% /raid3
/dev/sdf1            1730092600 208326584 1433882436  13% /raid4
/dev/sdg1            1730092600 220788532 1421420488  14% /raid5
spdc00:/pnfsdoors       400000     80000    284000  22% /pnfs/if.usp.br
[mdias@spraid mdias]$ free
             total       used       free     shared    buffers     cached
Mem:       2055440    2038452      16988          0     581612    1175224
-/+ buffers/cache:     281616    1773824
Swap:      4192956      12724    4180232

Tambem estamos down no http://cms-project-phedex.web.cern.ch/cms-project-phedex/cgi-bin/browser.

Updates

11h05. Nao mexi em nada e estamos ok novamente no Phedex e no ganglia, mas extremamente instavel, com alguns nodes caindo de tempos em tempos. Os logs estao ok. A spg00 esta " pingavel" mas atingiu o pico de utilizacao,. Estou logado na spraid

[mdias@spraid mdias]$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2             2.0G  586M  1.3G  31% /
none                 1004M     0 1004M   0% /dev/shm
/dev/sda7            1012M   34M  927M   4% /tmp
/dev/sda5             9.9G  2.1G  7.3G  23% /usr
/dev/sda8              15G  1.4G   13G  10% /usr/local
/dev/sda6             2.0G  405M  1.5G  22% /var
/dev/sdb1             1.8T   89G  1.6T   6% /raid0
/dev/sdc1             1.8T  920G  741G  56% /raid1
/dev/sdd1             1.7T  250G  1.3T  16% /raid2
/dev/sde1             1.7T  212G  1.4T  14% /raid3
/dev/sdf1             1.7T  198G  1.4T  13% /raid4
/dev/sdg1             1.7T  210G  1.4T  14% /raid5
spdc00:/pnfsdoors     391M   79M  278M  22% /pnfs/if.usp.br
[mdias@spraid mdias]$ free
             total       used       free     shared    buffers     cached
Mem:       2055440    2038660      16780          0     579184    1174588
-/+ buffers/cache:     284888    1770552
Swap:      4192956      12684    4180272
Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r1 | More topic actions
 
Home
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback