2

I have the following workstation configuration with 64 cores:

  • CPU: 4x AMD Athlon 6378 2.4GHz
  • RAM: 128GB
  • mobo: Supermicro H8QGI-F-O
  • SSD: 2 x 512GB Samsung with software RAID-0 setup

I am running Ubuntu Server 14.04. I was getting the following error for all Ubuntu Server kernels I tried (3.13.0-32 up to 3.13.0-45). I am running a molecular dynamics simulation when running it on more than 20 processors, the machine significantly slows down up to a point of freezing (error messages from /var/log/kern.log posted below). It runs just fine when running just one instance of the program...there is no trouble with the simulation package, I have run it on different servers in 64 copies and it ran just fine. I have also booted CentOS 7 and Ubuntu 12.04 from live CD on my machine and ran 64 instances of the code, and it never slowed down / froze. Ubuntu 12.04 with kernel 13.0.-32 ran the software just fine from live CD boot, but always froze with my Ubuntu 14.04 server installation. Could it be possibly caused by some of the loaded modules in the kernel? I have tried memtest (no problem), and also stressing the computer with running 64 copies of CPUburn, all worked fine, so it seems as a peculiar error.

Jun 12 10:40:15 vochomurka kernel: [  233.746081] WARNING: CPU: 59 PID: 4337 at /build/buildd/linux-3.13.0/kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xd0()
Jun 12 10:40:15 vochomurka kernel: [  233.746084] Watchdog detected hard LOCKUP on cpu 59
Jun 12 10:40:15 vochomurka kernel: [  233.746086] Modules linked in: rfcomm bnep bluetooth binfmt_misc kvm_amd kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd serio_raw amd64_edac_mod edac_core fam15h_power k10temp edac_mce_amd nvidia(POX) sp5100_tco i2c_piix4 drm shpchp joydev mac_hid parport_pc ppdev lp parport pata_acpi hid_generic usbhid hid raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath igb linear i2c_algo_bit psmouse dca ahci ptp pata_atiixp libahci pps_core
Jun 12 10:40:15 vochomurka kernel: [  233.746140] CPU: 59 PID: 4337 Comm: xargs Tainted: P           OX 3.13.0-45-generic #74-Ubuntu
Jun 12 10:40:15 vochomurka kernel: [  233.746143] Hardware name: Supermicro H8QG6/H8QG6, BIOS 3.5        12/16/2013
Jun 12 10:40:15 vochomurka kernel: [  233.746145]  0000000000000009 ffff882066d65c38 ffffffff81720eb6 ffff882066d65c80
Jun 12 10:40:15 vochomurka kernel: [  233.746174]  ffff882066d65c70 ffffffff810677cd ffff88203a840000 0000000000000000
Jun 12 10:40:15 vochomurka kernel: [  233.746187]  ffff882066d65d88 0000000000000000 ffff882066d65ef8 ffff882066d65cd0
Jun 12 10:40:15 vochomurka kernel: [  233.746201] Call Trace:
Jun 12 10:40:15 vochomurka kernel: [  233.746203]    [] dump_stack+0x45/0x56
Jun 12 10:40:15 vochomurka kernel: [  233.746220]  [] warn_slowpath_common+0x7d/0xa0
Jun 12 10:40:15 vochomurka kernel: [  233.746226]  [] warn_slowpath_fmt+0x4c/0x50
Jun 12 10:40:15 vochomurka kernel: [  233.746233]  [] ? restart_watchdog_hrtimer+0x50/0x50
Jun 12 10:40:15 vochomurka kernel: [  233.746239]  [] watchdog_overflow_callback+0x9c/0xd0
Jun 12 10:40:15 vochomurka kernel: [  233.746246]  [] __perf_event_overflow+0x8e/0x240
Jun 12 10:40:15 vochomurka kernel: [  233.746254]  [] ? ioremap_page_range+0x241/0x320
Jun 12 10:40:15 vochomurka kernel: [  233.746260]  [] perf_event_overflow+0x14/0x20
Jun 12 10:40:15 vochomurka kernel: [  233.746267]  [] x86_pmu_handle_irq+0x144/0x190
Jun 12 10:40:15 vochomurka kernel: [  233.746275]  [] ? unmap_kernel_range_noflush+0x11/0x20
Jun 12 10:40:15 vochomurka kernel: [  233.746282]  [] perf_event_nmi_handler+0x2b/0x50
Jun 12 10:40:15 vochomurka kernel: [  233.746288]  [] nmi_handle.isra.3+0x88/0x180
Jun 12 10:40:15 vochomurka kernel: [  233.746294]  [] do_nmi+0x169/0x340
Jun 12 10:40:15 vochomurka kernel: [  233.746299]  [] end_repeat_nmi+0x1e/0x2e
Jun 12 10:40:15 vochomurka kernel: [  233.746307]  [] ? __write_lock_failed+0x13/0x20
Jun 12 10:40:15 vochomurka kernel: [  233.746312]  [] ? __write_lock_failed+0x13/0x20
Jun 12 10:40:15 vochomurka kernel: [  233.746317]  [] ? __write_lock_failed+0x13/0x20
Jun 12 10:40:15 vochomurka kernel: [  233.746319]  >  [] _raw_write_lock_irq+0x1e/0x20
Jun 12 10:40:15 vochomurka kernel: [  233.746330]  [] do_exit+0x5a9/0xa50
Jun 12 10:40:15 vochomurka kernel: [  233.746336]  [] do_group_exit+0x3f/0xa0
Jun 12 10:40:15 vochomurka kernel: [  233.746341]  [] SyS_exit_group+0x14/0x20
Jun 12 10:40:15 vochomurka kernel: [  233.746348]  [] system_call_fastpath+0x1a/0x1f
Jun 12 10:40:15 vochomurka kernel: [  233.746350] ---[ end trace 04f618100e4ac70c ]---
Jun 12 10:40:29 vochomurka kernel: [  251.810867] pbs_sched[2739]: segfault at 0 ip 00007fc20f1927fc sp 00007fff726e1d50 error 4 in libtorque.so.2.0.0[7fc20f180000+2c000]
Jun 12 10:41:25 vochomurka kernel: [  312.822760] ------------[ cut here ]------------
Jun 12 10:41:25 vochomurka kernel: [  312.822775] WARNING: CPU: 59 PID: 4360 at /build/buildd/linux-3.13.0/kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xd0()
Jun 12 10:41:25 vochomurka kernel: [  312.822777] Watchdog detected hard LOCKUP on cpu 59
Jun 12 10:41:25 vochomurka kernel: [  312.822779] Modules linked in: rfcomm bnep bluetooth binfmt_misc kvm_amd kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd serio_raw amd64_edac_mod edac_core fam15h_power k10temp edac_mce_amd nvidia(POX) sp5100_tco i2c_piix4 drm shpchp joydev mac_hid parport_pc ppdev lp parport pata_acpi hid_generic usbhid hid raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath igb linear i2c_algo_bit psmouse dca ahci ptp pata_atiixp libahci pps_core
Jun 12 10:41:25 vochomurka kernel: [  312.822832] CPU: 59 PID: 4360 Comm: pbs_iff Tainted: P        W  OX 3.13.0-45-generic #74-Ubuntu
Jun 12 10:41:25 vochomurka kernel: [  312.822834] Hardware name: Supermicro H8QG6/H8QG6, BIOS 3.5        12/16/2013
Jun 12 10:41:25 vochomurka kernel: [  312.822837]  0000000000000009 ffff882066d65c38 ffffffff81720eb6 ffff882066d65c80
Jun 12 10:41:25 vochomurka kernel: [  312.822870]  ffff882066d65c70 ffffffff810677cd ffff88203a840000 0000000000000000
Jun 12 10:41:25 vochomurka kernel: [  312.822893]  ffff882066d65d88 0000000000000000 ffff882066d65ef8 ffff882066d65cd0
Jun 12 10:41:25 vochomurka kernel: [  312.822911] Call Trace:
Jun 12 10:41:25 vochomurka kernel: [  312.822913]    [] dump_stack+0x45/0x56
Jun 12 10:41:25 vochomurka kernel: [  312.822931]  [] warn_slowpath_common+0x7d/0xa0
Jun 12 10:41:25 vochomurka kernel: [  312.822936]  [] warn_slowpath_fmt+0x4c/0x50
Jun 12 10:41:25 vochomurka kernel: [  312.822943]  [] ? restart_watchdog_hrtimer+0x50/0x50
Jun 12 10:41:25 vochomurka kernel: [  312.822949]  [] watchdog_overflow_callback+0x9c/0xd0
Jun 12 10:41:25 vochomurka kernel: [  312.822956]  [] __perf_event_overflow+0x8e/0x240
Jun 12 10:41:25 vochomurka kernel: [  312.822964]  [] ? ioremap_page_range+0x241/0x320
Jun 12 10:41:25 vochomurka kernel: [  312.822970]  [] perf_event_overflow+0x14/0x20
Jun 12 10:41:25 vochomurka kernel: [  312.822978]  [] x86_pmu_handle_irq+0x144/0x190
Jun 12 10:41:25 vochomurka kernel: [  312.822985]  [] ? unmap_kernel_range_noflush+0x11/0x20
Jun 12 10:41:25 vochomurka kernel: [  312.822993]  [] perf_event_nmi_handler+0x2b/0x50
Jun 12 10:41:25 vochomurka kernel: [  312.822998]  [] nmi_handle.isra.3+0x88/0x180
Jun 12 10:41:25 vochomurka kernel: [  312.823004]  [] do_nmi+0xd0/0x340
Jun 12 10:41:25 vochomurka kernel: [  312.823009]  [] end_repeat_nmi+0x1e/0x2e
Jun 12 10:41:25 vochomurka kernel: [  312.823017]  [] ? kzfree+0x2d/0x30
Jun 12 10:41:25 vochomurka kernel: [  312.823024]  [] ? __write_lock_failed+0x13/0x20
Jun 12 10:41:25 vochomurka kernel: [  312.823030]  [] ? __write_lock_failed+0x13/0x20
Jun 12 10:41:25 vochomurka kernel: [  312.823035]  [] ? __write_lock_failed+0x13/0x20
Jun 12 10:41:25 vochomurka kernel: [  312.823037]  >  [] _raw_write_lock_irq+0x1e/0x20
Jun 12 10:41:25 vochomurka kernel: [  312.823048]  [] do_exit+0x30b/0xa50
Jun 12 10:41:25 vochomurka kernel: [  312.823053]  [] do_group_exit+0x3f/0xa0
Jun 12 10:41:25 vochomurka kernel: [  312.823059]  [] SyS_exit_group+0x14/0x20
Jun 12 10:41:25 vochomurka kernel: [  312.823065]  [] system_call_fastpath+0x1a/0x1f
Jun 12 10:41:25 vochomurka kernel: [  312.823067] ---[ end trace 04f618100e4ac70d ]---
Jun 12 10:41:25 vochomurka kernel: [  312.823071] perf samples too long (4775 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
karel
  • 122,292
  • 133
  • 301
  • 332
petr
  • 21

0 Answers0