闲话

环境又崩溃了,脆弱的环境,脆弱的国产。

问题现象

这次没有死机,还能响应中断,登陆界面还能响应鼠标键盘操作,但就是不能登陆,切换到文本控制台也不行,无奈,只能重启,重启后messages中记录到了Synchronous External Abort:

[   74.645343] Unhandled fault: synchronous external abort (0x92000210) at 0x0000007f925deec8

由于没有出现panic,肯定是出现了用户态了,由于之前就把环境中的core文件相关的配置弄好了,搜集到了core文件,gdb解析,打印如下:

Core was generated by `/usr/lib/systemd/systemd --switched-root --system --deserialize 21'.
Program terminated with signal SIGBUS, Bus error.
#0  0x0000007f7ded55c8 in kill () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install systemd-219-13.nsd6.aarch64
(gdb) bt
#0  0x0000007f7ded55c8 in kill () from /lib64/libc.so.6
#1  0x0000005590aaae38 in crash.lto_priv ()
#2  <signal handler called>
#3  0x0000007f7df1a5a8 in malloc_consolidate () from /lib64/libc.so.6
#4  0x0000007f7df1d13c in _int_malloc () from /lib64/libc.so.6
#5  0x0000007f7df1f310 in malloc () from /lib64/libc.so.6
#6  0x0000007f7dee3e48 in realpath@@GLIBC_2.17 () from /lib64/libc.so.6
#7  0x0000007f7dd77c5c in canonicalize_path () from /lib64/libmount.so.1
#8  0x0000007f7dd5af30 in canonicalize_path_and_cache () from /lib64/libmount.so.1
#9  0x0000007f7dd6628c in mnt_table_parse_stream () from /lib64/libmount.so.1
#10 0x0000007f7dd6653c in mnt_table_parse_file () from /lib64/libmount.so.1
#11 0x0000007f7dd66af8 in __mnt_table_parse_mtab () from /lib64/libmount.so.1
#12 0x0000005590a84598 in mount_load_proc_self_mountinfo ()
#13 0x0000005590a886b8 in mount_dispatch_io ()
#14 0x0000005590b22ad4 in source_dispatch.lto_priv ()
#15 0x0000005590ab791c in sd_event_dispatch ()
#16 0x0000005590b34aa4 in manager_loop ()
#17 0x0000005590a7a69c in main ()

显然,故障发生在systemd进程的上下文中,难怪不能登陆了~。出现在malloc的流程中。

分析

最关键的还是错误码(code):0x92000210,查看ArmV8的手册,前六位对应为:

100100 Data Abort from a lower Exception level

即:来自较低异常级别的数据异常。

白话版:来自用户态的数据异常。

更白话:用户态中,CPU访问(读或写)内存数据时发生了异常。

再看看code后面bit的含义:

  • ISV, bit [24] Instruction syndrome valid. Indicates whether the syndrome information in ISS[23:14] is valid. 0 No valid instruction syndrome. ISS[23:14] are RES0. 1 ISS[23:14] hold a valid instruction syndrome.

    本code中,该位为0,表示没有可用的instruction syndrome

  • EA, bit [9] External abort type. This bit can provide an IMPLEMENTATION DEFINED classification of external aborts. For any abort other than an External abort this bit returns a value of

    本code中,该位为1,表示为External abort

  • WnR, bit [6] Write not Read. Indicates whether a synchronous abort was caused by a write instruction or a read instruction. The possible values of this bit are: 0 Abort caused by a read instruction. 1 Abort caused by a write instruction. For faults on cache maintenance and address translation instructions, this bit always returns a value of 1. For an asynchronous Data Abort exception this bit is UNKNOWN.

    该位为0,表示读错误

-DFSC, bits [5:0] Data Fault Status Code. Possible values of this field are:

最后六位表示具体的错误含义,本code为:010000,手册中对应的解释为:

010000 Synchronous external abort, other than synchronous parity or ECC error, not on translation table walk

意思是:同步外部异常,但不是因为内存奇偶校验、不是因为ECC错误、也不是因为translation table walk。这个就比较含糊了,通常所见的同步外部异常可能就这几种,其他的都是未知的,所以这个code仅表明了“我不知道是啥异常”?呵呵,开玩笑吧,这个可能只能硬件designer来解释了。