Speeding up SQL queries

One of the issues I faced was loading large SQL queries with millions of rows of state data to a pandas dataframe. This seemed to be the new bottleneck to some of our users and the large RAM usage is a deal breaker.

To combat this, I did some investigating using the memory-profiler with different query approaches.

  1. cursor.fetchall()
  2. pd.read_sql()
  3. pd.read_sql(stream_results=True)

In summary, the best performing one was pd.read_sql(stream_results=True) with a 50% reduction in memory usage. The reduction came from not having to store 2 copies of the same data on the client side, which both fetchall() and read_sql() both do.

More details on the code below with supporting images of memory usage. (Note the first 100mb spike is from other sections of the code which can be ignored)

## DB initialising
mydb = create_engine(
            f"mysql+pymysql://{user}:{password}@{host}:{port}/{database}"
        )
query = f"SELECT * from states"
  1. cursor.fetchall()
connection = mydb.raw_connection()
mycursor = connection.cursor()
mycursor.execute(query)
myresult = mycursor.fetchall()

# Clean to DF
col_names = []
for elt in mycursor.description:
     col_names.append(elt[0])
df = pd.DataFrame.from_dict(myresult)
df.columns = col_names

Using cursor.fetchall() ~45mb

  1. pd.read_sql()
with mydb.connect() as con:
	list_df = [
		df
		for df in pd.read_sql(
			query,
			con=con,
			index_col="state_id",
			parse_dates=["last_changed", "last_updated"],
			chunksize=1000,
		)
	]
	df_output = pd.concat(list_df)

Using pd.read_sql() ~45mb

  1. pd.read_sql(stream_results=True)
with mydb.connect() as con:
	con = con.execution_options(stream_results=True)
	list_df = [
		df
		for df in pd.read_sql(
			query,
			con=con,
			index_col="state_id",
			parse_dates=["last_changed", "last_updated"],
			chunksize=1000,
		)
	]
	df_output = pd.concat(list_df)

Using pd.read_sql(stream_results=True) ~20mb