1 comments

  • LoMoGan a day ago

    With the rise of vision-language models (VLMs) (such as Qwen-VL and GPT-4.1), new end-to-end OCR models like DeepSeek-OCR have emerged. These models jointly understand visual and textual information, enabling direct interpretation of PDFs without an explicit layout detection step.

    However, this paradigm shift raises an important question:

    If a VLM can already process both the document images and the query to produce an answer directly, do we still need the intermediate OCR step?