calamine读取xlsx文件的方法比较

发布于:2025-08-14 ⋅ 阅读:(11) ⋅ 点赞:(0)

calamine是纯rust编写的电子表格文件读取库,支持Excel(xls, xlsx, xlsm, xlsb, xla, xlam)、opendocument电子表格(ods)。
它自带的示例程序excel_to_csv.rs采用workbook.worksheet_range来读取,很多AI和网上的介绍文章都说,处理大文件workbook.worksheet_range_ref更快,我实际测试了一下。

测试步骤
0.准备测试xlsx文件
在duckdb中生成tpch数据,用excel插件写入xlsx文件。

duckdb
load tpch;
load excel;
call dbgen(sf=1);
copy (from lineitem limit 1048575)to 'exli2.xlsx' WITH (FORMAT xlsx, HEADER true) ;

1.拉取calamine源码

git clone --depth=1 https://gitclone.com/github.com/tafia/calamine
Cloning into 'calamine'...
remote: 对象计数中: 133, 完成.
remote: 压缩对象中: 100% (109/109), 完成.
remote: Total 133 (delta 21), reused 133 (delta 21)
Receiving objects: 100% (133/133), 5.08 MiB | 48.00 KiB/s, done.
Resolving deltas: 100% (21/21), done.

2.编译执行

export PATH=/par/mold240/bin:$PATH

cd /par/calamine
cargo build --release --example excel_to_csv

time /par/calamine/target/release/examples/excel_to_csv /par/exli2.xlsx Sheet1

real	0m21.959s
user	0m19.332s
sys	0m2.180s

3.修改excel_to_csv.rs为excel_to_csv_ref.rs
基本上就是把函数worksheet_range改为worksheet_range_ref,加入use calamine::ReaderRef;和use calamine::DataRef;数据类型Data修改为DataRef<'>,再根据编译器提示加入DataRef::<'>::SharedString的处理。

// SPDX-License-Identifier: MIT
//
// Copyright 2016-2025, Johann Tuffe.

//! An example for using the `calamine` crate to convert an Excel file to CSV.
//!
//! Converts XLSX, XLSM, XLSB, and XLS files. The filename and sheet name must
//! be specified as command line arguments. The output CSV will be written to a
//! file with the same name as the input file, but with a `.csv` extension.

use std::env;
use std::fs::File;
use std::io::{BufWriter, Write};
use std::path::PathBuf;

use calamine::{open_workbook_auto, Data, Range, Reader};
use calamine::ReaderRef;
use calamine::DataRef;
// usage: cargo run --example excel_to_csv file.xls[xmb] sheet_name
//
// Where:
// - `file.xls[xmb]` is the Excel file to convert. Required.
// - `sheet_name` is the name of the sheet to convert. Required.
//
// The output will be written to a file with the same name as the input file,
// including the path, but with a `.csv` extension.
//
fn main() {
    let excel_file = env::args()
        .nth(1)
        .expect("Please provide an excel file to convert");

    let sheet_name = env::args()
        .nth(2)
        .expect("Expecting a sheet name as second argument");

    let excel_path = PathBuf::from(excel_file);
    match excel_path.extension().and_then(|s| s.to_str()) {
        Some("xlsx") | Some("xlsm") | Some("xlsb") | Some("xls") => (),
        _ => panic!("Expecting an excel file"),
    }

    let csv_path = excel_path.with_extension("csv");
    let mut csv_file = BufWriter::new(File::create(csv_path).unwrap());
    let mut workbook = open_workbook_auto(&excel_path).unwrap();
    let range = workbook.worksheet_range_ref(&sheet_name).unwrap();

    write_to_csv(&mut csv_file, &range).unwrap();
}

// Write the Excel data as strings to a CSV file. Uses a semicolon (`;`) as the
// field separator.
//
// Note, this is a simplified version of CSV and doesn't handle quoting of
// separators or other special cases. See the `csv.rs` crate for a more robust
// solution.
fn write_to_csv<W: Write>(output_file: &mut W, range: &Range<DataRef<'_>>) -> std::io::Result<()> {
    let max_column = range.get_size().1 - 1;

    for rows in range.rows() {
        for (col_number, cell_data) in rows.iter().enumerate() {
            match *cell_data {
                DataRef::<'_>::Empty => Ok(()),
                DataRef::<'_>::Int(ref i) => write!(output_file, "{i}"),
                DataRef::<'_>::Bool(ref b) => write!(output_file, "{b}"),
                DataRef::<'_>::Error(ref e) => write!(output_file, "{e:?}"),
                DataRef::<'_>::Float(ref f) => write!(output_file, "{f}"),
                DataRef::<'_>::DateTime(ref d) => write!(output_file, "{}", d.as_f64()),
                DataRef::<'_>::String(ref s) | DataRef::<'_>::DateTimeIso(ref s) | DataRef::<'_>::DurationIso(ref s) => {
                    write!(output_file, "{s}")
                } , DataRef::<'_>::SharedString(s) => write!(output_file, "{s}")
            }?;

            // Write the field separator except for the last column.
            if col_number != max_column {
                write!(output_file, ";")?;
            }
        }

        write!(output_file, "\r\n")?;
    }

    Ok(())
}

4.编译执行

cargo build --release --example excel_to_csv_ref

mv /par/exli2.csv /par/exli2.csv.old
time /par/calamine/target/release/examples/excel_to_csv_ref /par/exli2.xlsx Sheet1

real	0m22.062s
user	0m19.412s
sys	0m2.592s
md5sum /par/exli2.csv*
05bcc52a09f1ee2bbb7c9b08172fb7ae  /par/exli2.csv
05bcc52a09f1ee2bbb7c9b08172fb7ae  /par/exli2.csv.old

结论:worksheet_range和worksheet_range_ref读取速度和读取结果完全一样。
polars和pandas的read_excel函数都支持使用calamine引擎,polars目前还是默认使用calamine引擎,测试结果如下

import polars as pl

pl.read_excel(source="exli2.xlsx",sheet_name="Sheet1",) 
shape: (1_048_575, 16)
┌────────────┬───────────┬───────────┬──────────────┬───┬───────────────┬───────────────────┬────────────┬─────────────────────────────────┐
│ l_orderkey ┆ l_partkey ┆ l_suppkey ┆ l_linenumber ┆ … ┆ l_receiptdate ┆ l_shipinstruct    ┆ l_shipmode ┆ l_comment                       │
│ ---        ┆ ---       ┆ ---       ┆ ---          ┆   ┆ ---           ┆ ---               ┆ ---        ┆ ---                             │
│ i64        ┆ i64       ┆ i64       ┆ i64          ┆   ┆ date          ┆ str               ┆ str        ┆ str                             │
╞════════════╪═══════════╪═══════════╪══════════════╪═══╪═══════════════╪═══════════════════╪════════════╪═════════════════════════════════╡
│ 115519077061            ┆ … ┆ 1996-03-22    ┆ DELIVER IN PERSON ┆ TRUCK      ┆ to beans x-ray carefull         │
│ 16731073112            ┆ … ┆ 1996-04-20    ┆ TAKE BACK RETURN  ┆ MAIL       ┆  according to the final foxes.… │
│ 16370037013            ┆ … ┆ 1996-01-31    ┆ TAKE BACK RETURN  ┆ REG AIR    ┆ ourts cajole above the furiou   │
│ 1213246334            ┆ … ┆ 1996-05-16    ┆ NONE              ┆ AIR        ┆ s cajole busily above t         │
│ 12402715345            ┆ … ┆ 1996-04-01    ┆ NONE              ┆ FOB        ┆  the regular, regular pa        │
│ …          ┆ …         ┆ …         ┆ …            ┆ … ┆ …             ┆ …                 ┆ …          ┆ …                               │
│ 10484826322482375            ┆ … ┆ 1992-05-17    ┆ NONE              ┆ MAIL       ┆  final accounts haggle about t… │
│ 10484838136013611            ┆ … ┆ 1994-08-30    ┆ NONE              ┆ AIR        ┆ ely around the regula           │
│ 10484831005250552            ┆ … ┆ 1994-07-08    ┆ DELIVER IN PERSON ┆ MAIL       ┆ of the bli                      │
│ 104848312950995103            ┆ … ┆ 1994-05-16    ┆ TAKE BACK RETURN  ┆ RAIL       ┆ ests impress. sly, even i       │
│ 10484835007025764            ┆ … ┆ 1994-08-29    ┆ DELIVER IN PERSON ┆ REG AIR    ┆ regularly expre                 │
└────────────┴───────────┴───────────┴──────────────┴───┴───────────────┴───────────────────┴────────────┴─────────────────────────────────┘
import time
t=time.time();x=pl.read_excel(source="exli2.xlsx",sheet_name="Sheet1",);print(time.time()-t)
23.807374477386475
t=time.time();x.write_csv("plexli2.csv", separator=",");print(time.time()-t)
0.6844663619995117
t=time.time();pl.read_excel(source="exli2.xlsx",sheet_name="Sheet1",).write_csv("plexli.csv", separator=",");print(time.time()-t)
22.80883002281189

import pandas as pd
import time
t=time.time();pd.read_excel("exli2.xlsx",engine="calamine").to_csv("pdexl2.csv", index=False, header=True);print(time.time()-t)
84.7169542312622

md5sum p*exl2.csv
814e86b96b4e15ae43bf005ca0f5c8fe  pdexl2.csv
md5sum plexli2.csv
814e86b96b4e15ae43bf005ca0f5c8fe  plexli2.csv

虽然通过python中转,polars与rust中直接转的时间差距微小,而pandas则慢了很多。


网站公告

今日签到

点亮在社区的每一天
去签到