Forums

Database Advice Needed

Hey,

I've just reached the mvp point in my application - essentially, users can add there websites to it, currnelty, I have 18 - these are sites I have access to as part of my private beta, however, upon adding them, I've already accumulated 6+ million rows of data, each one (user website) will generate up to 5 thousand db rows per day.

Is mysql on PA ok to handle such data? It's essentually 1 table made up of about 8 columns. I'm not really concerned with writes to the database as this isn't going to effect the user, just as long as it updates their info every 24 hours.

I know this isn't much to go off, I don't know what info you might need. I've been testing on a mongodb but I don't want the hastle of rewriting things if it's not needed.

Happy to give more info etc, but ultimatly, will having 100's of millions of rows be a nightmare for me and my users when reading their data?

Just worked it out, each profile will make up 1,825, 000 ish rows per year. Assuming some relative success and I got 100 profiles added, that would be 182, 5000 db rows.

Loading my apps pages to show breakdowns of a profiles data takes 30+ seconds at the moment

At first glance it sounds like it might be pushing the limits of what mysql can do.

From the scalability section here, it seems that they can do ~5bn rows.

Also let me discuss with my other pythonanywhere coworkers etc after the wkend and get back to you about this.

When you say breakdown of a profiles data, do you mean that you are hitting a view that is doing some sort of query on the database and it is taking 30+s to return? If so, maybe there are some db indexing tweaks, or python code optimization that you could do.

yea, so essentially, its a breakdown of the sites landing pages along with statistics on keywords etc - the query looks like:

select landing_page, count(distinct keyword) from search_data where profile_id={} group by landing_page order by count(keyword) desc

I did have thoughts about using multiple database - maybe 1 per user or even profile though, it doesn't sound like good practice, though an application level join to find the correct db or table doesn't sound like it would be too expensive it terms of load times but again, this is where my expericence lets me down.

There is, of course, the option for me to build temporary tables around this query (though the query wasn't my original question I know) which would help with load times etc but I plan to build date range lookups for my users so it would'nt really work :/

Sorry, just a final note, I'm happy for you guys to have a look at it if it helps with your understanding of the question

Like Conrad says, it sounds like you're going to hit scaling limits with that kind of data.

Is there any possibility for some kind of caching? Like, for example, you could pre-process each month's data into some kind of summary so that you don't have to keep re-querying it? That's essentially what we do for the hit charts on the "Web" tab -- all data over 31 days old gets consolidated down to summary per-month data. Detail gets lost, but it's a reasonable tradeoff for our use case -- maybe not in your case.

I'm looking at this https://aws.amazon.com/rds/aurora/ is it essentially what I'm already on using PA? I was thinking to use this as I'll be in the same datacenters as the PA servers.

Thanks

Our normal database servers are just RDS MySQL instances rather than Aurora, though we've certainly considered using Aurora as an alternative -- it looks pretty good! We're just a little concerned that there may be compatibility issues if people were expecting MySQL on PythonAnywhere but we were using something that wasn't quite MySQL under the hood.

Obviously that concern doesn't matter if you're thinking of getting your own DB server. If you create an Aurora instance in the us-east-1c "Availability Zone" then you'll be right next to our servers and should get minimal latency.