I have the following workstation configuration with 64 cores:
- CPU: 4x AMD Athlon 6378 2.4GHz
- RAM: 128GB
- mobo: Supermicro H8QGI-F-O
- SSD: 2 x 512GB Samsung with software RAID-0 setup
I am running Ubuntu Server 14.04. I was getting the following error for all Ubuntu Server kernels I tried (3.13.0-32 up to 3.13.0-45). I am running a molecular dynamics simulation when running it on more than 20 processors, the machine significantly slows down up to a point of freezing (error messages from /var/log/kern.log posted below). It runs just fine when running just one instance of the program...there is no trouble with the simulation package, I have run it on different servers in 64 copies and it ran just fine. I have also booted CentOS 7 and Ubuntu 12.04 from live CD on my machine and ran 64 instances of the code, and it never slowed down / froze. Ubuntu 12.04 with kernel 13.0.-32 ran the software just fine from live CD boot, but always froze with my Ubuntu 14.04 server installation. Could it be possibly caused by some of the loaded modules in the kernel?
I have tried memtest (no problem), and also stressing the computer with running 64 copies of CPUburn, all worked fine, so it seems as a peculiar error.
Jun 12 10:40:15 vochomurka kernel: [ 233.746081] WARNING: CPU: 59 PID: 4337 at /build/buildd/linux-3.13.0/kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xd0() Jun 12 10:40:15 vochomurka kernel: [ 233.746084] Watchdog detected hard LOCKUP on cpu 59 Jun 12 10:40:15 vochomurka kernel: [ 233.746086] Modules linked in: rfcomm bnep bluetooth binfmt_misc kvm_amd kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd serio_raw amd64_edac_mod edac_core fam15h_power k10temp edac_mce_amd nvidia(POX) sp5100_tco i2c_piix4 drm shpchp joydev mac_hid parport_pc ppdev lp parport pata_acpi hid_generic usbhid hid raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath igb linear i2c_algo_bit psmouse dca ahci ptp pata_atiixp libahci pps_core Jun 12 10:40:15 vochomurka kernel: [ 233.746140] CPU: 59 PID: 4337 Comm: xargs Tainted: P OX 3.13.0-45-generic #74-Ubuntu Jun 12 10:40:15 vochomurka kernel: [ 233.746143] Hardware name: Supermicro H8QG6/H8QG6, BIOS 3.5 12/16/2013 Jun 12 10:40:15 vochomurka kernel: [ 233.746145] 0000000000000009 ffff882066d65c38 ffffffff81720eb6 ffff882066d65c80 Jun 12 10:40:15 vochomurka kernel: [ 233.746174] ffff882066d65c70 ffffffff810677cd ffff88203a840000 0000000000000000 Jun 12 10:40:15 vochomurka kernel: [ 233.746187] ffff882066d65d88 0000000000000000 ffff882066d65ef8 ffff882066d65cd0 Jun 12 10:40:15 vochomurka kernel: [ 233.746201] Call Trace: Jun 12 10:40:15 vochomurka kernel: [ 233.746203] [] dump_stack+0x45/0x56 Jun 12 10:40:15 vochomurka kernel: [ 233.746220] [] warn_slowpath_common+0x7d/0xa0 Jun 12 10:40:15 vochomurka kernel: [ 233.746226] [] warn_slowpath_fmt+0x4c/0x50 Jun 12 10:40:15 vochomurka kernel: [ 233.746233] [] ? restart_watchdog_hrtimer+0x50/0x50 Jun 12 10:40:15 vochomurka kernel: [ 233.746239] [] watchdog_overflow_callback+0x9c/0xd0 Jun 12 10:40:15 vochomurka kernel: [ 233.746246] [] __perf_event_overflow+0x8e/0x240 Jun 12 10:40:15 vochomurka kernel: [ 233.746254] [] ? ioremap_page_range+0x241/0x320 Jun 12 10:40:15 vochomurka kernel: [ 233.746260] [] perf_event_overflow+0x14/0x20 Jun 12 10:40:15 vochomurka kernel: [ 233.746267] [] x86_pmu_handle_irq+0x144/0x190 Jun 12 10:40:15 vochomurka kernel: [ 233.746275] [] ? unmap_kernel_range_noflush+0x11/0x20 Jun 12 10:40:15 vochomurka kernel: [ 233.746282] [] perf_event_nmi_handler+0x2b/0x50 Jun 12 10:40:15 vochomurka kernel: [ 233.746288] [] nmi_handle.isra.3+0x88/0x180 Jun 12 10:40:15 vochomurka kernel: [ 233.746294] [] do_nmi+0x169/0x340 Jun 12 10:40:15 vochomurka kernel: [ 233.746299] [] end_repeat_nmi+0x1e/0x2e Jun 12 10:40:15 vochomurka kernel: [ 233.746307] [] ? __write_lock_failed+0x13/0x20 Jun 12 10:40:15 vochomurka kernel: [ 233.746312] [] ? __write_lock_failed+0x13/0x20 Jun 12 10:40:15 vochomurka kernel: [ 233.746317] [] ? __write_lock_failed+0x13/0x20 Jun 12 10:40:15 vochomurka kernel: [ 233.746319] > [] _raw_write_lock_irq+0x1e/0x20 Jun 12 10:40:15 vochomurka kernel: [ 233.746330] [] do_exit+0x5a9/0xa50 Jun 12 10:40:15 vochomurka kernel: [ 233.746336] [] do_group_exit+0x3f/0xa0 Jun 12 10:40:15 vochomurka kernel: [ 233.746341] [] SyS_exit_group+0x14/0x20 Jun 12 10:40:15 vochomurka kernel: [ 233.746348] [] system_call_fastpath+0x1a/0x1f Jun 12 10:40:15 vochomurka kernel: [ 233.746350] ---[ end trace 04f618100e4ac70c ]--- Jun 12 10:40:29 vochomurka kernel: [ 251.810867] pbs_sched[2739]: segfault at 0 ip 00007fc20f1927fc sp 00007fff726e1d50 error 4 in libtorque.so.2.0.0[7fc20f180000+2c000] Jun 12 10:41:25 vochomurka kernel: [ 312.822760] ------------[ cut here ]------------ Jun 12 10:41:25 vochomurka kernel: [ 312.822775] WARNING: CPU: 59 PID: 4360 at /build/buildd/linux-3.13.0/kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xd0() Jun 12 10:41:25 vochomurka kernel: [ 312.822777] Watchdog detected hard LOCKUP on cpu 59 Jun 12 10:41:25 vochomurka kernel: [ 312.822779] Modules linked in: rfcomm bnep bluetooth binfmt_misc kvm_amd kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd serio_raw amd64_edac_mod edac_core fam15h_power k10temp edac_mce_amd nvidia(POX) sp5100_tco i2c_piix4 drm shpchp joydev mac_hid parport_pc ppdev lp parport pata_acpi hid_generic usbhid hid raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath igb linear i2c_algo_bit psmouse dca ahci ptp pata_atiixp libahci pps_core Jun 12 10:41:25 vochomurka kernel: [ 312.822832] CPU: 59 PID: 4360 Comm: pbs_iff Tainted: P W OX 3.13.0-45-generic #74-Ubuntu Jun 12 10:41:25 vochomurka kernel: [ 312.822834] Hardware name: Supermicro H8QG6/H8QG6, BIOS 3.5 12/16/2013 Jun 12 10:41:25 vochomurka kernel: [ 312.822837] 0000000000000009 ffff882066d65c38 ffffffff81720eb6 ffff882066d65c80 Jun 12 10:41:25 vochomurka kernel: [ 312.822870] ffff882066d65c70 ffffffff810677cd ffff88203a840000 0000000000000000 Jun 12 10:41:25 vochomurka kernel: [ 312.822893] ffff882066d65d88 0000000000000000 ffff882066d65ef8 ffff882066d65cd0 Jun 12 10:41:25 vochomurka kernel: [ 312.822911] Call Trace: Jun 12 10:41:25 vochomurka kernel: [ 312.822913] [] dump_stack+0x45/0x56 Jun 12 10:41:25 vochomurka kernel: [ 312.822931] [] warn_slowpath_common+0x7d/0xa0 Jun 12 10:41:25 vochomurka kernel: [ 312.822936] [] warn_slowpath_fmt+0x4c/0x50 Jun 12 10:41:25 vochomurka kernel: [ 312.822943] [] ? restart_watchdog_hrtimer+0x50/0x50 Jun 12 10:41:25 vochomurka kernel: [ 312.822949] [] watchdog_overflow_callback+0x9c/0xd0 Jun 12 10:41:25 vochomurka kernel: [ 312.822956] [] __perf_event_overflow+0x8e/0x240 Jun 12 10:41:25 vochomurka kernel: [ 312.822964] [] ? ioremap_page_range+0x241/0x320 Jun 12 10:41:25 vochomurka kernel: [ 312.822970] [] perf_event_overflow+0x14/0x20 Jun 12 10:41:25 vochomurka kernel: [ 312.822978] [] x86_pmu_handle_irq+0x144/0x190 Jun 12 10:41:25 vochomurka kernel: [ 312.822985] [] ? unmap_kernel_range_noflush+0x11/0x20 Jun 12 10:41:25 vochomurka kernel: [ 312.822993] [] perf_event_nmi_handler+0x2b/0x50 Jun 12 10:41:25 vochomurka kernel: [ 312.822998] [] nmi_handle.isra.3+0x88/0x180 Jun 12 10:41:25 vochomurka kernel: [ 312.823004] [] do_nmi+0xd0/0x340 Jun 12 10:41:25 vochomurka kernel: [ 312.823009] [] end_repeat_nmi+0x1e/0x2e Jun 12 10:41:25 vochomurka kernel: [ 312.823017] [] ? kzfree+0x2d/0x30 Jun 12 10:41:25 vochomurka kernel: [ 312.823024] [] ? __write_lock_failed+0x13/0x20 Jun 12 10:41:25 vochomurka kernel: [ 312.823030] [] ? __write_lock_failed+0x13/0x20 Jun 12 10:41:25 vochomurka kernel: [ 312.823035] [] ? __write_lock_failed+0x13/0x20 Jun 12 10:41:25 vochomurka kernel: [ 312.823037] > [] _raw_write_lock_irq+0x1e/0x20 Jun 12 10:41:25 vochomurka kernel: [ 312.823048] [] do_exit+0x30b/0xa50 Jun 12 10:41:25 vochomurka kernel: [ 312.823053] [] do_group_exit+0x3f/0xa0 Jun 12 10:41:25 vochomurka kernel: [ 312.823059] [] SyS_exit_group+0x14/0x20 Jun 12 10:41:25 vochomurka kernel: [ 312.823065] [] system_call_fastpath+0x1a/0x1f Jun 12 10:41:25 vochomurka kernel: [ 312.823067] ---[ end trace 04f618100e4ac70d ]--- Jun 12 10:41:25 vochomurka kernel: [ 312.823071] perf samples too long (4775 > 2500), lowering kernel.perf_event_max_sample_rate to 50000