java解析word文档

发布于：2025-07-17 ⋅ 阅读:(16) ⋅ 点赞:(0)

文章目录

读取段落
读取图片
读取表格内容
页码

读取段落

读取段落内容非常简单。以下是一个demo:

public static void main(String[] args) {
    try(FileInputStream stream = new FileInputStream("parse/pages.docx")) {
        XWPFDocument document = new XWPFDocument(stream);
        List<XWPFParagraph> paragraphs = document.getParagraphs();
        for (XWPFParagraph paragraph: paragraphs) {
            System.out.println(paragraph.getText());
        }
    } catch (FileNotFoundException e) {
        throw new RuntimeException(e);
    } catch (IOException e) {
        throw new RuntimeException(e);
    }
}

读取图片

读取word里的图片也不难了，只需要获取XWPFPictureData对象就可以了，然后就可以获取到图片内容的byte数组。


public static void main(String[] args) {
    try(FileInputStream stream = new FileInputStream("parse/pages.docx")) {
        XWPFDocument document = new XWPFDocument(stream);
        List<XWPFPictureData> allPictures = document.getAllPictures();
        for (XWPFPictureData pictureData: allPictures) {
            byte[] data = pictureData.getData();
            File file = new File(pictureData.getFileName());
            Files.write(file.toPath(), data);
        }
    } catch (FileNotFoundException e) {
        throw new RuntimeException(e);
    } catch (IOException e) {
        throw new RuntimeException(e);
    }
}

读取表格内容

word中的表格，是XWPFTable-XWPFTableRow-XWPFTableCell的三级结构。有个这个三级结构，就非常好写代码获取了。

public static void main(String[] args) {
    try(FileInputStream stream = new FileInputStream("parse/table.docx")) {
        XWPFDocument document = new XWPFDocument(stream);
        List<XWPFTable> tables = document.getTables();
        for (XWPFTable table: tables) {
            List<XWPFTableRow> rows = table.getRows();
            for (XWPFTableRow row: rows) {
                List<XWPFTableCell> tableCells = row.getTableCells();
                for (XWPFTableCell cell: tableCells) {
                    System.out.print(cell.getText());
                    System.out.print("\t");
                }
                System.out.println();
            }
        }
    } catch (FileNotFoundException e) {
        throw new RuntimeException(e);
    } catch (IOException e) {
        throw new RuntimeException(e);
    }
}

页码

实际工作中，解析word的场景少，生成word的场景多。但是如果有个需求是获取word特定一页的内容呢？比如说获取第9页内容，怎么办？可以说非常难实现，因为apache poi只能读取word底层的xml模型，实际的页码需要渲染才知道。

java解析word文档

文章目录

读取段落

读取图片

读取表格内容

页码

网站公告

今日签到

热门文章

最新发布