tesseract – 咖啡偶-IT日常

docker安裝apache tika 文件辨識系統與N8N整合

現在主流文件辨識，應該都交由vision功能的AI模型處裡，辨識度高；若不想花錢使用AI，可以用apache tika頂著用。

docker 安裝 apache tika

docker run -d -p 9998:9998 --name tika-server-ocr apache/tika:latest-full
# 目前是3.1.0

安裝完畢，進入容器，安裝中文語言套件

docker exec -u root -it tika-server-ocr bash
###
apt update
apt-get install tesseract-ocr-chi-sim tesseract-ocr-chi-tra
###

測試

# Linux
# 中文圖檔 test.png
curl -T test.png http://127.0.0.1:9998/tika --header "X-Tika-OCRLanguage: eng+chi_tra+chi_sim"

N8N(nodemation)設定

前一個節點要把檔案準備好，再新增以下節點，丟給tika處理，回傳設定為text