环境介绍
- 编译主机:amd64 + Ubuntu 22.04
- Android源码:Android15 GKI
- Kernel版本:Linux 6.16
- Android构建系统:bazel构建
- 工具链:gcc-arm-10.3-2021.07-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu-
定位Linux kernel crash问题的步骤
通常Linux Kernel crash时会有堆栈信息输出,从堆栈信息中可以知道导致Kernel crash的大概原因、Kernel crash时系统状态、Kernel crash时在执行什么。
根据Kernel crash log定位异常问题的步骤:
- 从log中确定异常方向、异常位置
- 从System.map中确定符号地址
- 通过addr2line工具确定异常代码位置
例子-定位Linux Kernel crash异常位置
从log中找异常信息
[ 6.974145][ T1] arm,isp e8100000.isp: Adding to iommu group 11
[ 6.980371][ T1] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 //从这里看是空指针异常
[ 6.989848][ T1] Mem abort info:
[ 6.993331][ T1] ESR = 0x0000000096000005
[ 6.997772][ T1] EC = 0x25: DABT (current EL), IL = 32 bits
[ 7.003775][ T1] SET = 0, FnV = 0
[ 7.007521][ T1] EA = 0, S1PTW = 0
[ 7.011355][ T1] FSC = 0x05: level 1 translation fault
[ 7.016923][ T1] Data abort info:
[ 7.020495][ T1] ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
[ 7.026672][ T1] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 7.032416][ T1] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 7.038419][ T1] [0000000000000000] user address but active_mm is swapper
[ 7.045464][ T1] Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
[ 7.052421][ T1] Modules linked in:
[ 7.056167][ T1] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 6.6.58-android15-8-maybe-dirty-4k-SE-SDK2P5 #1 1400000003000000474e55008fa9e0c15629191d
[ 7.069549][ T1] Hardware name: TI Davince Evaluation board (DT)
[ 7.076245][ T1] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 7.083896][ T1] pc : readl+0x38/0x80
[ 7.087819][ T1] lr : readl+0x38/0x80
[ 7.091738][ T1] sp : ffffffc0828eb7f0
[ 7.095743][ T1] x29: ffffffc0828eb7f0 x28: 0000000000000000 x27: 0000000000000000
[ 7.103570][ T1] x26: 0000000000000000 x25: 0000000000000000 x24: ffffff8c97335d70
[ 7.111396][ T1] x23: ffffff8eded5d8a8 x22: 0000000000000001 x21: ffffff8c97335d94
[ 7.119220][ T1] x20: ffffffc080cc3e64 x19: 0000000000000000 x18: ffffffc0828c50a0
[ 7.127046][ T1] x17: ffffffc0826e7a40 x16: ffffffc0826e7a70 x15: 001f00003fffffff
[ 7.134872][ T1] x14: 0000000000000901 x13: 2000000000000000 x12: 0000000000000008
[ 7.142697][ T1] x11: 000000000000002b x10: 0000000000000200 x9 : 0000000000000400
[ 7.150522][ T1] x8 : 0000000000000007 x7 : 6e69616d6f642d72 x6 : 0000000000000004
[ 7.158348][ T1] x5 : 0000000000005dc8 x4 : ffffffc08181b2a8 x3 : ffffffc080cc3e64
[ 7.166173][ T1] x2 : ffffffc080cc400c x1 : 0000000000000000 x0 : 0000000000000020
[ 7.173998][ T1] Call trace:
[ 7.177136][ T1] readl+0x38/0x80 //这里看是isp_clk_gate_onoff -> readl踩到空指针
[ 7.180708][ T1] isp_clk_gate_onoff+0x5c/0x204 //这里看到isp driver中isp_clk_gate_onoff()执行时发生空指针异常
[ 7.185495][ T1] isp_platform_probe+0x3ac/0x9f8
[ 7.190369][ T1] platform_probe+0xc0/0xec
[ 7.194724][ T1] really_probe+0x190/0x374
[ 7.199076][ T1] __driver_probe_device+0xa0/0x12c
[ 7.204122][ T1] driver_probe_device+0x3c/0x218
[ 7.208996][ T1] __driver_attach+0x110/0x1ec
[ 7.213608][ T1] bus_for_each_dev+0x104/0x160
[ 7.218310][ T1] driver_attach+0x24/0x34
[ 7.222576][ T1] bus_add_driver+0x154/0x270
[ 7.227104][ T1] driver_register+0x68/0x104
[ 7.231630][ T1] __platform_driver_probe+0x50/0xc8
[ 7.236764][ T1] fw_module_init+0x30/0x78
[ 7.241118][ T1] do_one_initcall+0xdc/0x360
[ 7.245645][ T1] do_initcall_level+0xc8/0x19c
[ 7.250347][ T1] do_initcalls+0x70/0xc0
[ 7.254527][ T1] do_basic_setup+0x1c/0x28
[ 7.258880][ T1] kernel_init_freeable+0xd0/0x138
[ 7.263841][ T1] kernel_init+0x20/0x1ac
[ 7.268022][ T1] ret_from_fork+0x10/0x20
[ 7.272290][ T1] Code: aa1303e1 aa1e03e3 aa1e03f4 97e989cd (b9400268)
[ 7.279072][ T1] ---[ end trace 0000000000000000 ]---
[ 7.287048][ T1] Kernel panic - not syncing: Oops: Fatal exception
[ 7.293483][ T1] SMP: stopping secondary CPUs
[ 7.298099][ T1] Kernel Offset: disabled
[ 7.302277][ T1] CPU features: 0x000002,c0000000,70020143,1001720b
异常原因:
[ 6.980371][ T1] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 7.045464][ T1] Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
从这两条日志可以确定导致kernel crash的原因是访问 "NULL pointer dereference"
异常位置:
[ 7.083896][ T1] pc : readl+0x38/0x80
从这条日志可以确定触发异常的操作
异常调用栈:
[ 7.173998][ T1] Call trace:
[ 7.177136][ T1] readl+0x38/0x80
[ 7.180708][ T1] isp_clk_gate_onoff+0x5c/0x204
[ 7.185495][ T1] isp_platform_probe+0x3ac/0x9f8
[ 7.190369][ T1] platform_probe+0xc0/0xec
[ 7.194724][ T1] really_probe+0x190/0x374
[ 7.199076][ T1] __driver_probe_device+0xa0/0x12c
[ 7.204122][ T1] driver_probe_device+0x3c/0x218
[ 7.208996][ T1] __driver_attach+0x110/0x1ec
[ 7.213608][ T1] bus_for_each_dev+0x104/0x160
[ 7.218310][ T1] driver_attach+0x24/0x34
[ 7.222576][ T1] bus_add_driver+0x154/0x270
[ 7.227104][ T1] driver_register+0x68/0x104
[ 7.231630][ T1] __platform_driver_probe+0x50/0xc8
[ 7.236764][ T1] fw_module_init+0x30/0x78
从调用栈可以大致判断异常发生的时间段。如上日志可以确定是isp driver加载阶段probe处理时出现的异常。"isp_clk_gate_onoff+0x5c/0x204 "可以进一步确定异常位置是isp_clk_gate_onoff符号为基地址的0x5c偏移位置,0x204是isp_clk_gate_onoff代码段长度。
从System.map符号表中找基地址
如上,找到isp_clk_gate_onoff符号的地址
通过addr2line工具确定代码位置
这里使用llvm-addr2line定位代码中的位置。为什么不用aarch64-none-linux-gnu-addr2line在遇到的问题一节有说明。
step1.导出llvm-addr2line工具
export PATH=/data/yuxi/xx-builder/src/android-gki/prebuilts/clang/host/linux-x86/llvm-binutils-stable:$PATH
/data/yuxi/xx-builder/src/android-gki是自己本地android15源码目录,android系统构建时会生成llvm工具。
step2.根据代码段地址定位代码中位置
通过objdump工具对异常位置反汇编
借助反汇编和异常日志可以对问题进行更深入的分析。
遇到的问题
1. aarch64-none-linux-gnu-addr2line: vmlinux: unable to initialize decompress status for section .debug_aranges
执行命令:aarch64-none-linux-gnu-addr2line -e vmlinux 0xffffffc080cc4928
异常日志:
aarch64-none-linux-gnu-addr2line: vmlinux: unable to initialize decompress status for section .debug_aranges
aarch64-none-linux-gnu-addr2line: vmlinux: unable to initialize decompress status for section .debug_aranges
aarch64-none-linux-gnu-addr2line: vmlinux: file format not recognized
异常原因:
vmlinux是Linux Kernel构建时生成的一个静态链接的可执行文件,通常是ELF格式。根据之前Linux Kernel经验来说这个文件是原始的、未压缩的Linux内核镜像。但从返回的信息看这个文件是压缩的,恰巧使用的这个aarch64-none-linux-gnu-工具链不能对这种压缩进行解压。
问题解:
使用LLVM工具链,LLVM工具链通常对较新的ELF特性支持更好,而且Android15源码构建时也会有LLVM工具链生成。