Forums

I got the error 'UnicodeDecodeError: 'utf-8' codec can't decode byte 0x98 in position 0: invalid start byte' while running my python script on PythonAnywhere console

Hi!

I have written a twitter bot which creates text in Farsi language, it runs good locally but when I try to run to run it here it gives this error in console:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x98 in position 0: invalid start byte

I even commented any line which printed something, but still this error remains..

Here is my code, thanks in advance

# Copyright (c) 2015–2016 Molly White
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

try:
    # Fix UTF8 output issues on Windows console.
    # Does nothing if package is not installed
    from win_unicode_console import enable
    enable()
except ImportError:
    pass

import os
import codecs, sys
import tweepy
from secrets import *
from time import gmtime, strftime
import requests
import json
import time

from secrets import *

auth = tweepy.OAuthHandler(C_KEY, C_SECRET)
auth.set_access_token(A_TOKEN, A_TOKEN_SECRET)
api = tweepy.API(auth)


# ====== Individual bot configuration ==========================
bot_username = ''
logfile_name = bot_username + ".log"

# ==============================================================
import random
from komaki import afaal
from komaki import mafool
from komaki import motamem
from komaki import afaalGozara
from komaki import shenase
from komaki import linka

f=open("bank.txt","r")
bank = []
akhbar=[]
for line in f:
    bank.append(line)
f.close()
for i in bank:
    i=i.replace('\u06cc','ي')
    i=i.replace('\u06f0','0')
    i=i.replace('\u06f1','1')
    i=i.replace('\u06f2','2')
    i=i.replace('\u06f3','3')
    i=i.replace('\u06f4','4')
    i=i.replace('\u06f5','5')
    i=i.replace('\u06f6','6')
    i=i.replace('\u06f7','7')
    i=i.replace('\u06f8','8')
    i=i.replace('\u06f9','9')


secure_random = random.SystemRandom()

def yeSaz(text):
    if text[-1]=="ه" or text[-1]=="ی":
        if text[-2]=="ی" or text[-2]=="ا" or text[-2]=="و":
            return text+"ی"
        else:
            return text+"\u200Cای"
    elif text[-1]=="و":
        return text+"یی"
    return text+"ی"

def isFel(a,c):
    if c.find(a)<0:
        return False
    x=0
    for i in shenase:
        if(c==a+i):
            x+=1
    if x==0:
        return False
    return True

def kootah(text):
    s=""
    for i in text:
        if i=="؟" or i=="|" or i==":" or i=="/":
            s=""
        elif i!=" " or (i==" " and len(s)>0):
            s+=i
    return s

def farsiSaz(text):
    text=text.replace('\u06cc','ي')
    text=text.replace('\u06f0','0')
    text=text.replace('\u06f1','1')
    text=text.replace('\u06f2','2')
    text=text.replace('\u06f3','3')
    text=text.replace('\u06f4','4')
    text=text.replace('\u06f5','5')
    text=text.replace('\u06f6','6')
    text=text.replace('\u06f7','7')
    text=text.replace('\u06f8','8')
    text=text.replace('\u06f9','9')
    return text

def notTekrari(title):
    #print(farsiSaz(kootah(title)))
    for i in bank:
        if (farsiSaz(kootah(title))+"\n")==farsiSaz(kootah(i)) or farsiSaz(kootah(title))==farsiSaz(kootah(i)) or ("\n"+farsiSaz(kootah(title)))==farsiSaz(kootah(i)) or ("\n"+farsiSaz(kootah(title))+"\n")==farsiSaz(kootah(i)):
            return False;
        if farsiSaz(kootah(i))+"!"==farsiSaz(kootah(title)) or farsiSaz(kootah(title))+"!"==farsiSaz(kootah(i)):
            return False
    for i in akhbar:
        if farsiSaz(kootah(title))==farsiSaz(kootah(i)):
            return False
    return True

def fel(title):
    c=""
    for i in title[::-1]:
        if i==' ':
            break
        if i!="!":
            c=i+c
    if c.find(" می ")>=0 or title.find(" که ")>=0 or title.find(" این")>=0:
        return False
    #if title.find("؟")>=0:
        #return False
    for i in afaal:
        if isFel(i,c):
            return True
    return False

def makeNews(link):
    r = requests.get(link)
    r.text
    data = json.loads(r.text)
    mojaz=[]
    c=0
    for i in data["items"]:
        if fel(data["items"][c]["title"]) and notTekrari(data["items"][c]["title"]):
            mojaz.append(c)
            akhbar.append(data["items"][c]["title"])
        c+=1
    for i in mojaz:
        print(i)
    for i in akhbar:
        i=kootah(i)
        i=farsiSaz(i)

def create_tweet(news):
    """Create the text of the tweet you want to send."""
    # Replace this with your code!
    if len(news)%2==0:
        text = news+"\nو تو هنوز نیامده"+"\u200C"+"ای"
    else:
        text = news+"\n"+yeSaz(secure_random.choice(mafool))+" را در "+(secure_random.choice(motamem))+" "+(secure_random.choice(afaalGozara))+"م"
    return text


def tweet(text):
    """Send out the text as a tweet."""
    # Twitter authentication
    auth = tweepy.OAuthHandler(C_KEY, C_SECRET)
    auth.set_access_token(A_TOKEN, A_TOKEN_SECRET)
    api = tweepy.API(auth)

    # Send the tweet and log success or failure
    try:
        api.update_status(text)
    except tweepy.error.TweepError as e:
        log(e.message)
    else:
        log("Tweeted: " + text)

def log(message):
    """Log message to logfile."""
    path = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
    with open(os.path.join(path, logfile_name), 'a+') as f:
        t = strftime("%d %b %Y %H:%M:%S", gmtime())
        f.write("\n" + t + " " + message)

if __name__ == "__main__":
    f=open("bank.txt","a")
    f.write("\n\n")
    for sch in range(1,5):
        akhbar.clear()
        for link in linka:
            makeNews(link)
        for i in akhbar:
            i=kootah(i)
            if i=="":
                print("INJA")
                continue
            tweet_text = create_tweet(i)
            if i=="":
                print("BUZZ")
            tweet_text=tweet_text.replace('\u06cc','ي')
            tweet_text=tweet_text.replace('\u06f0','0')
            tweet_text=tweet_text.replace('\u06f1','1')
            tweet_text=tweet_text.replace('\u06f2','2')
            tweet_text=tweet_text.replace('\u06f3','3')
            tweet_text=tweet_text.replace('\u06f4','4')
            tweet_text=tweet_text.replace('\u06f5','5')
            tweet_text=tweet_text.replace('\u06f6','6')
            tweet_text=tweet_text.replace('\u06f7','7')
            tweet_text=tweet_text.replace('\u06f8','8')
            tweet_text=tweet_text.replace('\u06f9','9')
            i=i.replace('\u06cc','ي')
            i=i.replace('\u06f0','0')
            i=i.replace('\u06f1','1')
            i=i.replace('\u06f2','2')
            i=i.replace('\u06f3','3')
            i=i.replace('\u06f4','4')
            i=i.replace('\u06f5','5')
            i=i.replace('\u06f6','6')
            i=i.replace('\u06f7','7')
            i=i.replace('\u06f8','8')
            i=i.replace('\u06f9','9')
            if tweet_text.find('\u06cc')>=0:
                print("YES")
            print(tweet_text)
            bank.append(i)
            f.write(i+"\n")
            tweet(tweet_text)
            if i!=kootah(akhbar[-1]):
                time.sleep(60*10)
        if(sch<4):
            print(sch)
            time.sleep(60*60*6)
    f.close()

can you post the full traceback (error message) please?

Yes here it is:

Traceback (most recent call last):
  File "/home/AmirNsr/tweetBot/bot.py", line 60, in <module>
    for line in f:
  File "/usr/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x98 in position 0: invalid start byte

Oh it was from the text file which I read data from, I change its encode to UTF-8 and it's gone!

But there's still an issue, the APIs I request are somehow blocked here:

Traceback (most recent call last):
  File "/home/AmirNsr/tweetBot/bot.py", line 206, in <module>
    makeNews(link)
  File "/home/AmirNsr/tweetBot/bot.py", line 152, in makeNews
    r = requests.get(link)
  File "/usr/local/lib/python3.6/dist-packages/requests/api.py", line 70, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/api.py", line 56, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 488, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 609, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 485, in send
raise ProxyError(e, request=request)
requests.exceptions.ProxyError: HTTPSConnectionPool(host='api.rss2json.com', port=443): Max retries 
exceeded with url: /v1/api.json?rss_url=http%3
A%2F%2Fwww.khabaronline.ir%2FRSS (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel 
connection failed: 403 Forbidden',)))

Do you know what may cause this?

http://help.pythonanywhere.com/pages/403ForbiddenError

Thank you, sir!

you're welcome!

Hi Amir and all, I'm getting a similar error after using code like this:

df = pd.DataFrame(list(collection.find()))
my_j =  df.to_json(orient='records')
return(my_j)

to get some data from mongodb into my pandas dataframe. I try to convert it to json and return this json to the browser. I believe there are "funny" characters in my data. Any tips on how I should try to convert it to utf-8?

Which Python version are you using to run your code? Could you copy/paste the full stack trace?

Here's my error (testing locally on a localhost mongodb). When I convert the df to json(orient='records') I get an issue, I believe due to "funny" characters in my df. Hope someone can spot something obvious. I guess I need to use utf-8.

Traceback (most recent call last):
  File "/home/py3/lib/python3.4/site-packages/flask/app.py", line 1997, in __call__
    return self.wsgi_app(environ, start_response)
  File "/home/py3/lib/python3.4/site-packages/flask/app.py", line 1985, in wsgi_app
    response = self.handle_exception(e)
  File "/home/py3/lib/python3.4/site-packages/flask/app.py", line 1540, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/home/py3/lib/python3.4/site-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/home/py3/lib/python3.4/site-packages/flask/app.py", line 1982, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/py3/lib/python3.4/site-packages/flask/app.py", line 1614, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/py3/lib/python3.4/site-packages/flask/app.py", line 1517, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/home/py3/lib/python3.4/site-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/home/py3/lib/python3.4/site-packages/flask/app.py", line 1612, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/py3/lib/python3.4/site-packages/flask/app.py", line 1598, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/home/test/app.py", line 119, in get_df
    my_j =  df.to_json(orient='records') # does not work due to some utf-8 issue!
  File "/home/py3/lib/python3.4/site-packages/pandas/core/generic.py", line 1245, in to_json
    lines=lines)
  File "/home/py3/lib/python3.4/site-packages/pandas/io/json/json.py", line 46, in to_json
    date_unit=date_unit, default_handler=default_handler).write()
  File "/home/py3/lib/python3.4/site-packages/pandas/io/json/json.py", line 90, in write
    default_handler=self.default_handler
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 147: invalid start byte

Edit: cannot be a funny character issue, I think.

I tried the simplest possible csv upload. It works if I use a pickle file. Switch to mongodb and it fails with this UnicodeDecodeError.

Hmm. It's complaining about a byte in your data with value 0xa9, which is definitely non-ASCII, and it's saying that it's not utf-8 either (presumably because in utf-8, if you had that byte, it would need another byte before it to tell the decoder how to decode it). When you say the "simplest possible" CSV, are you sure it only contains ASCII characters? Is the stack trace the same when you use it?

Hmm, no I'm not 100% sure. I'm going to check. It goes into mongodb with no problem.

Got a little help from someone and we suspect it's something to do with pymongo. That 0xa9 decoded with latin-1 is a copyright symbol, btw. We seemed to rule out an issue with Google sheets and Flask-Excel. The encode/decode issue happens when trying to put the df into mongo and take it back out. Hmm.

Interesting -- so (tentatively) Mongo is returning data in latin-1 encoding?

I think that idea turned out to be a red herring. The issue as far as I can tell is that MongoDB sends along its _id field in "ObjectId" format, and I could not convert that to json using pandas to_json(). If I copy the data (not really desirable as a real solution) to a new dictionary without that _id as a quick test, things work fine. I haven't checked into this further though. I think there is probably a canonical, Pythonic way to ignore the _id when I get the data from mongo to my dataframe, but I need to learn about that a bit more.