读取段落
读取段落内容非常简单。以下是一个demo:
public static void main(String[] args) {
try(FileInputStream stream = new FileInputStream("parse/pages.docx")) {
XWPFDocument document = new XWPFDocument(stream);
List<XWPFParagraph> paragraphs = document.getParagraphs();
for (XWPFParagraph paragraph: paragraphs) {
System.out.println(paragraph.getText());
}
} catch (FileNotFoundException e) {
throw new RuntimeException(e);
} catch (IOException e) {
throw new RuntimeException(e);
}
}
读取图片
读取word里的图片也不难了,只需要获取XWPFPictureData对象就可以了,然后就可以获取到图片内容的byte数组。
public static void main(String[] args) {
try(FileInputStream stream = new FileInputStream("parse/pages.docx")) {
XWPFDocument document = new XWPFDocument(stream);
List<XWPFPictureData> allPictures = document.getAllPictures();
for (XWPFPictureData pictureData: allPictures) {
byte[] data = pictureData.getData();
File file = new File(pictureData.getFileName());
Files.write(file.toPath(), data);
}
} catch (FileNotFoundException e) {
throw new RuntimeException(e);
} catch (IOException e) {
throw new RuntimeException(e);
}
}
读取表格内容
word中的表格,是XWPFTable-XWPFTableRow-XWPFTableCell的三级结构。有个这个三级结构,就非常好写代码获取了。
public static void main(String[] args) {
try(FileInputStream stream = new FileInputStream("parse/table.docx")) {
XWPFDocument document = new XWPFDocument(stream);
List<XWPFTable> tables = document.getTables();
for (XWPFTable table: tables) {
List<XWPFTableRow> rows = table.getRows();
for (XWPFTableRow row: rows) {
List<XWPFTableCell> tableCells = row.getTableCells();
for (XWPFTableCell cell: tableCells) {
System.out.print(cell.getText());
System.out.print("\t");
}
System.out.println();
}
}
} catch (FileNotFoundException e) {
throw new RuntimeException(e);
} catch (IOException e) {
throw new RuntimeException(e);
}
}
页码
实际工作中,解析word的场景少,生成word的场景多。但是如果有个需求是获取word特定一页的内容呢?比如说获取第9页内容,怎么办?可以说非常难实现,因为apache poi只能读取word底层的xml模型,实际的页码需要渲染才知道。