[问题] 使用pytesseract 做ocr

楼主: PHONm (USA~USA)   2016-08-08 17:32:48
我想要做字符辨识,但是字符图像有些破裂,有些字元会变成乱码,
所以就用OpenCV先进行一些前处理,然后存成新档后再进行一次OCR,
但是会有UnicodeDecodeError,可是程式码都没有用到中文啊@@!
不晓得是否是OpenCV转档那边出问题,
=====================Result=====================
<PIL.PngImagePlugin.PngImageFile image mode=RGBA size=397x112 at 0x3C0DF28>
24-D 1813f-ml 1-1
154?Dbb
<PIL.BmpImagePlugin.BmpImageFile image mode=RGB size=397x112 at 0x1131080>
Traceback (most recent call last):
File "C:/Users/cash.chien/PycharmProjects/OCR/OCRv1.1.py", line 19, in
<module>
str2 = image_to_string(img2)
File "C:\Anaconda3\lib\site-packages\pytesseract\pytesseract.py", line 167,
in image_to_string
return f.read().strip()
UnicodeDecodeError: 'cp950' codec can't decode byte 0xe2 in position 11:
illegal multibyte sequence
========================以下为原始码=======================
from pytesseract import image_to_string
from PIL import Image
import time
import cv2
import numpy as np
img = Image.open('12.bmp')
print(img)
str = image_to_string(img)
print(str)
img1 = cv2.imread('12.bmp',1)
kernel = np.ones((3,3))
opening = cv2.morphologyEx(img1, cv2.MORPH_OPEN, kernel)
cv2.imwrite('opening.bmp',opening)
img2 = Image.open('opening.bmp')
print(img2)
str2 = image_to_string(img2)
print(str2)
感谢!
作者: Sunal (SSSSSSSSSSSSSSSSSSSSSSS)   2016-08-09 02:16:00
应该是cmd的输出问题 改成utf8试试
作者: goldflower (金色小黄花)   2016-08-11 00:22:00
先转str的编码 不过你直接把str命名掉不太好吧XD

Links booklink

Contact Us: admin [ a t ] ucptt.com