Arxiv-Daily

发布于:2025-08-06 ⋅ 阅读:(10) ⋅ 点赞:(0)

Spatially Grounded Explanations inVision-Language Models for Document VisualQuestion Answering

Maximiliano Hormazabal1,2 () , Héctor Cerezo-Costas2 and DimosthenisKaratzas1

1 Computer Vision Center, Universitat Autonoma de Barcelona, Spain{mhormazabal,dimos}@cvc.uab.es2 Gradiant, Vigo, Galicia, Spainhcerezo@gradiant.org

Abstract. We introduce EaGERS, a fully training-free and model-agnosticpipeline that (l) generates natural language rationales via a vision lan-guage model, (2) grounds these rationales to spatial sub-regions by com-puting multimodal embedding similarities over a configurable grid withmajority voting, and (3) restricts the generation of responses only fromthe relevant regions selected in the masked image. Experiments on theDocVQA dataset demonstrate that our best configuration not only out-performs the base model on exact match accuracy and Average Normal-ized Levenshtein Similarity metrics but also enhances transparency andreproducibility in DocVQA without additional model fine-tuning. Codeavailable at: https://github.com/maxhormazabal/EaGERS-DvQA

Keywords: Document Intelligence · Visual Question Answering · Mul-timodal Reasoning · Explainability

1 Introduction

Document Visual Question Answering (DocVQA) has advanced rapidly withTransformer-based methods that integrate OCR, layout modeling, and domainadaptation [1,7,4,14]. Concurrently, general-purpose Vision Language Models(VLMs) [19,11,8] achieve strong document understanding without explicit DocVQAtraining.

Deploying off-the-shelf vision-language models in enterprise pipelines ofteninvolves costly fine-tuning, unstable prompt engineering and a lack of cleargrounding between answers and source regions [21]. To address these challenges,we introduce Explanation-Guided Region Selection (EaGERS), a fully model-agnostic, training-free DocVQA pipeline that (i) generates natural language ex-planations, (ii) selects the top sub-regions over a configurable grid via multimodalembedding similarities and majority voting, and (ii) re-queries the model on amasked image so that answers derive solely from those validated regions ensuring

The problem we address is: how to enforce that the answer can be recon-structed solely from document regions that are explicitly grounded and ver-balised, without any additional training of the VLM.

The main contributions of this work are:

  1. EaGERS: A fully model-agnostic and training-free DocVQA pipeline capableof generating answers on masked document images using general-purposemultimodal models.

  2. Integration of text explanations and visual masking to contribute to thetraceability and explainability of inferences.

2 Related 1 Work

2.1 DocVQA

In recent years, DocVQA systems have achieved solid baselines on the DocVQAdataset [12], some approaches combine OCR and QA modules such LayoutLM[20] and TILT, yet still lag behind human performance. Other transformer-basedmodels, such as DocFormer and Donut, adopt end-to-end, OCR-free architec-tures, and supervised-attention methods like M4C [5] integrate textual, posi-tional, and visual cues to boost retrieval accuracy; however, their interpretabil-ity remains stuck to attention-weight analysis. Today, state-of-the-art OCR-freeapproaches reach near-human accuracy but offer no natural language rationales.Meanwhile, multimodal compression frameworks like mPLUG-DocOwl 2.0 im-prove scalability but still lack explanations grounded in specific document re-gions.

2.2 Explainability en DocVQA

In the area of spatial explainability, methods such as DocXplain [17] apply abla-demonstrating higher fidelity than Grad-CAM [18] at the cost of multiple infer-ences. Hybrid models like DLaVA [13] combine textual answers with boundingboxes, achieving Intersection over Union (IoU) [16] above 0.5 on DocVQA, whileMRVQA [10] introduces textual rationales alongside visual highlights and pro-poses specific visual-text coherence metrics.

2.3 Modal Alignment and Multimodal Embeddings

Modal alignment projects text and image representations into a shared latentspace, allowing direct comparison via geometric metrics such as cosine similarity.Pretrained models like CLIP [15], ALIGN [6], and BLIP [9] employ a multimodalcontrastive objective to bring semantically related pairs closer together. In ourpipeline, we use embeddings from BLIP, CLIP, and ALIGN to vectorize bothnatural language explanations and document sub-regions, enabling the multi-modal similarity measurements that drive masking and focused re-querying.

3 Methodology

EaGERS-DocVQA leverages the knowledge of general-purpose multimodal mod-els in document understanding without requiring dedicated training. Figure 1shows the overall architecture of the proposed system, which consists of threemain stages: A) Explanation generation, B) Region selection, and C) Answergeneration. In our experiments, we use the Qwen2.5VL-3B model as the corecomponent; however, the proposed system is essentially model-agnostic

Fig. 1. EaGERS Document VQA pipeline: (1) the multimodal model generates a spa-tial natural language explanation from the image and the question; (2) the image issegmented into an m X n grid; (3) embeddings of the explanation and each sub-regionare obtained using BLIP, CLIP, and ALIGN; (4) majority voting selects the most rel-evant regions; (5) the image is masked to retain only those regions, and the model isre-queried with the question to generate the final answer.

3.1 Explanation Generation

The document image and question are passed to a vision-language model, whichuses them to generate a natural language explanation of how to obtain therequested answer in visual terms. This explanation is not the final answer butserves as a guide to locate the relevant information in the image. We employthis inference in subsequent steps as a semantic tool for comparing the imagesub-regions with the generated explanation. Although these spatial explanationstypically exhibit good alignment with relevant regions, there are occasional caseswhere the generated explanations may inaccurately refer to irrelevant areas,potentially impacting subsequent region selection.

3.2 Region selection via similarity

The document image is divided into an m X n grid, yielding m · n sub-regions,each of which may or may not be relevant for obtaining the answer.

Each sub-region is converted into a vector representation in order to compareit with the model’s spatial explanation by an ensemble of three multimodal mod-els: BLIP, CLIP, and ALIGN to generate embeddings in a complementary wayin order to mitigate specific biases and improve robustness across heterogeneousdocument layouts.

Once we obtain a similarity score based on cosine similarity between reason-ing and sub-region embeddings for each embedder, we select the top k for thefinal answer, where k is 30% (rounded up) of sub-regions based on preliminaryexperiments, as this provided a good compromise between spatial granularity forevidence localization. Although an adaptive grid might better handle irregularlayouts, the fixed grid simplifies reproducibility and reduces complexity for thisinitial study.

The final ranking is determined by majority voting across the rankings fromeach embedder. In the event of ties during majority voting, we resolve theseby prioritizing regions based on their average cosine similarity scores across allembedders, thus favoring sub-regions with more consistent overall relevance.The resulting list of selected sub-regions R defines the visible area used in thesubsequent answer generation stage.

3.3 Masking and re-query

We create a masked version of the original image in which all grid cells outside Rare filled with black. The question and the masked image are then reintroducedto the same multimodal model, which must generate the answer using onlythe information within the justified regions, without access to the previouslygenerated explanation except implicitly through the defined region mask.

In the following section we evaluate how effectively this approach recoversground-truth answers under different grid and margin configurations.

4Experimentation Experimentation

4.1 Datasets and Evaluation Protocols

In our experiments, we use the validation split of the DocVQA Single Pagedataset. We applied resizing preprocessing (preserving the aspect ratio) to com-press images and optimize inferences. For spatial partitioning, we divide eachimage into a uniform grid of 5 columns and 5 rows (25 cells) in an initial se-ries of tests, and an alternative configuration of 5 columns and 10 rows (50cells) to explore the impact of granularity on relevant-region selection. We alsohave tested a margin expansion of the 15% of the unmasked sub-region for bothamount of cells.

We use Exact Match (EM) and Average Normalized Levenshtein Similarity(ANLS) [2] which mitigates the impact of misrecognition errors by thresholdingnormalized edit distances.

4.2 Main Results

The table 1 presents the performance results (ANLS and EM) of the differentpipeline configurations. For comparison we have also run the Qwen2.5-VL-3Bmodel directly in the DocVQA dataset to compare its performance with thepipeline. In addition to this, the unit inference time (model-only and EaGERSpipeline) have been measured to calculate the average inference time for eachinsights of the similarity of times inference between document?. insights of the similarity of times inference between document?.

As shown, adding a 15% masking margin consistently improves both EMand ANLS across grid sizes. In particular, the 50 cells grid with a 15% marginyields the best performance, suggesting that finer spatial granularity combinedwith slight overlap enhances the localization and understanding of regions rele-vant to the answer. This suggests that using a fixed distribution grid across theentire image can introduce complications when measuring cosine distance be-tween embeddings. Specifically, relevant zones may fall on the borders betweensub-regions, which could explain discrepancies (such as a margin of 0 versus 15)when the grid slightly shifts and either includes or excludes the answer location.This indicates clear future steps in improving the system towards a more fexiblegrid.

It is also relevant to see that the performance of the model “as-is” has gainedexplainability without experiencing loss in its performance level, but in the bestversion EaGERS5o|15 presents a modest improvement pointing out that restrict-ing the viewing space of the model so that it has more clarity of the answer isan interesting research direction.

Although this first study does not aspire to a head-to-head comparison withthe state-of-the-art methods yet. It is also relevant to comment that EaGERSis above the solutions initially proposed in DocVQA [12] which have shownANLS results in VQA models such as LoRRA=0.110 or M4C=0.385, BERT QAsystems around 0.655 and multimodal architectures such as LayoutLMv2-BASEwith an ANLS of 0.7421. However, next steps will be to evaluate this modelagainst state-of-the-art solutions, which achieve even higher ANLS values.

5Limitations and Future Work

Among the limitations of this study that suggest promising directions for futurework: relying on fixed grid configurations may not generalize well to documentswith irregular layouts or variable aspect ratios; object detectors may be useful forsubdividing relevant regions [3]. Another important limitation is the dependenceon spatial explanations generated by VLMs, which may occasionally produceinaccurate rationales, leading to incorrect region selections. Our pipeline assumesspatial accuracy of VLM explanations; however, when these do not match groundtruth, the final fidelity may degrade which an aspect we will address in futureWork.

Future work should include systematic evaluations of the frequency and im-pact of such inaccuracies on the final results, we will evaluate more robust fusionstrategies, such as Reciprocal Rank Fusion (RRF), and quantify agreement usingKrippendorff alpha (o), in order to analyze in depth the internal consistency ofspatial selections.

It would also be valuable to extend experiments to datasets that explicitlyinvolve answer-localization annotations, enabling the use of quantitative fidelitymetrics, such as (IoU) or visual-text coherence scores. Moreover, efficiency im-provements are needed considering the increase in inference time reported inTable 1. Finally, we plan to explore methods to assess properly the level ofexplainability in comparison with alternatives.

6Conclusions Conclusions

We have proposed a pipeline that unifies natural language explanations withquantitative region selection and masked re-querying to ensure answers deriveonly from validated document regions. Our approach yields significant improve-ments in Exact Match and ANLS over standard baselines, demonstrating en-hanced transparency and reproducibility. In future work, we will investigateadaptive grid partitioning to better handle structural variability, conduct com-prehensive ablation studies to optimize embedder selection, and directly compareperformance with more advanced state-of-the-art models. We also plan to inte-grate quantitative explainability metrics and carry out user studies to assess theclarity and reliability of the generated explanations.

References

  1. Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y., Manmatha, R.: Docformer: End-to-end transformer for document understanding. In: Proceedings of the IEEE/CVFinternational conference on computer vision. pp. 993-1003 (2021)2.Biten, A., Tito, R., Mafla, A., Gomez, L., Rusinol, M., Jawahar, C., Val-veny, E., Karatzas, D.: Scene text visual question answering, pp. 4290-4300.Proceedings of the IEEE International Conference on Computer Vision, Insti-tute of Electrical and Electronics Engineers Inc., United States (Oct 2019).https: //doi.org/10.1109/Iccv.2019.00439, funding Information: This work hasbeen supported by projects TIN2017-89779-P, Marie-Curie (712949 TECNIOspringPLUS), aBSINTHE (Fundacion BBVA 2017), the CERCA Programme / Gener-alitat de Catalunya, a European Social Fund grant (CCI: 2014ES05SFOP007),NVIDIA Corporation and PhD scholarships from AGAUR (2019-FIB01233) andthe UAB. Publisher Copyright: ① 2019 IEEE.3.Gomez,L.,Biten,A.F.,Tito,R.,Mafa,A.,Rusinol, M.,Valveny, E.,Karatzas, D.: Multimodal grid features and cell pointers for scene textvisual question answering. Pattern Recognition Letters 150, 242-249(2021). https://doi.org/https://doi.org/10.1016/j.patrec.2021.06.026,https://www.sciencedirect.com/science/article/pii/S01678655210023364. Hu, A., Xu, H., Zhang, L., Ye, J., Yan, M., Zhang, J., Jin, Q., Huang, F., Zhou,J.: mplug-docowl2: High-resolution compressing for ocr-free multi-page documentunderstanding (2024), https: //arxiv.org/abs/2409.034205. Hu, R., Singh, A., Darrell, T., Rohrbach, M.:Iterative Answer Pre- Iterative Answer Pre-diction With Pointer-Augmented Multimodal Transformers for TextVQA.In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recog-nition(CVPR).pp.9989-9999.IEEEComputerS Society, Los Alami-tos, CA, USA (Jun 2020). https://doi.org/10.1109/CVPR42600.2020.01001,https://doi.ieeecomputersociety.org/10.1109/CvPR42600.2020.010016. Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung,Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representationlearning with noisy text supervision. In: Meila, M., Zhang, T. (eds.) Proceed-ings of the 38th International Conference on Machine Learning. Proceedings ofMachine Learning Research, vol. 139, pp. 4904-4916. PMLR (18-24 Jul 2021),https://proceedings.mlr.press/v139/jia21b.html7. Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S.,Han, D., Park, S.: Ocr-free document understanding transformer. In: EuropeanConference on Computer Vision. pp. 498-517. Springer (2022)8. Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: Krause, A.,Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceed-ings of the 4Oth International Conference on Machine Learning. Proceedings ofMachine Learning Research, vol. 202, pp. 19730-19742. PMLR (23-29 Jul 2023),https://proceedings.mlr.press/v202/li23q.html9. Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Chaudhuri,K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S. (eds.) Proceedingsof the 39th International Conference on Machine Learning. Proceedings of Ma-chine Learning Research, vol. 162, pp. 12888-12900. PMLR (17-23 Jul 2022),https://proceedings.mlr.press/v162/li22n.html

  2. Li, K., Vosselman, G., Yang, M.Y.: Convincing rationales for visual question an-swering reasoning (2025), https://arxiv.org/abs/2402.0389611. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visualinstruction instruction tuning.In:Oh,A.,Naumann,T.,( Oh,A.,Naumann, T.,Globerson,A., Saenko, K., Hardt,M.,Levine, S. (eds.)Advances inNeural 1 Information Sys-tems. vol. 36, pp. 34892-34916. Curran tems. vol. 34892-34916. Curran Associates, Inc. (2023),https://proceedings.neurips.cc/paper_files/paper/2023/file/6dcf277ea32c12. Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: A dataset for vqa on documentimages. In: Proceedings of the IEEE/CVF winter conference on applications ofcomputer vision. pp. 2200-2209 (2021)13. Mohammadshirazi, A., Neogi, P.P.G., Lim, S.N., Ramnath, R.: Dlava: Documentlanguage and vision assistant for answer localization with enhanced interpretabilityand trustworthiness (2024), https://arxiv.org/abs/2412.0015114. Powalski, R., Borchmann, L., Jurkiewicz, D., Dwojak, T., Pietruszka, M., Palka,G.: Going full-tilt boogie on document understanding with text-image-layout trans-former. In: Document Analysis and Recognition-ICDAR 2021: 16th International16. pp.732-747. Springer (2021)15. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferablevisual models from natural language supervision. In: Meila, M., Zhang, T. (eds.)Proceedings of the 38th International Conference on Machine Learning. Proceed-ings of Machine Learning Research, vol. 139, pp. 8748-8763. PMLR (18-24 Jul2021), https://proceedings.mlr.press/v139/radford21a.html16. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.:Generalized Intersection Over Union: A Metric and a Loss for BoundingBox Regression . In: 2019 IEEE/CVF Conference on Computer Vision andAlamitos, CA, USA (Jun 2019). https://doi.org/10.1109/CvPR.2019.00075, Alamitos, CA, USA (Jun 2019). https://doi.org/10.1109/CvPR.2019.00075,https://doi.ieeecomputersociety.org/10.1109/CvPR.2019.0007517. Saifullah, S., Agne, S., Dengel, A., Ahmed, S.: Docxplain: A novel model-agnosticexplainability method for document image classification. In: International Confer-(18. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In:2017 IEEE International Conference on Computer Vision (ICCV). pp. 618-626(2017). https: //doi.org/10.1109/ICCv.2017.7419. Singh, A., Hu, R., Goswami, V., Couairon, G., Galuba, W., Rohrbach, M.,Kiela, D.: Flava: A foundational language and vision alignment model. In: 2022IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).pp. 15617-15629 (2022). https://doi.org/10.1109/CvPR52688.2022.0151920.Xu,Y.,Li,M.,Cui,L.,Huang, S.,Wei, F.,Zhou, M.: Layoutlm: Zhou, M.: Layoutlm:Pre-training oftext and layout for document image understanding.In: Proceedings of the 26th ACM SIGKDD International Confer-enceon. Knowledge Discovery Data Mining. p. 1192-1200. KDD’20, ACM (Aug 2020). https: //doi.0rg/10.1145/3394486.3403172,http://dx.doi.0rg/10.1145/3394486.340317221. Zhou, Y., Muresanu, A.I., Han, Z., Paster, K., Pitis, S., Chan, H., Ba, J.: Largelanguage models are human-level prompt engineers. In: The Eleventh InternationalConference on Learning Representations (2022)


网站公告

今日签到

点亮在社区的每一天
去签到