I’m attempting to use SQL to pull data from a database into a Jupyter (python) notebook and work with it thereI have a query that pulls the yearweek of flight’s upload date, and counts the number of flights in that yearweekFinally, it groups the results by the yearweek of upload date:
SELECT YEARWEEK(d.upload_date), COUNT(f.id)
FROM apps_flight f
LEFT JOIN apps_enginedatafile d ON d.id=f.import_file_id
WHERE f.global_duplicate = 0
GROUP BY YEARWEEK(d.upload_date)
I want to count number of subscribers (located in another table) from each yearweek to compare them to count of flightsSo I’m trying to join said table by adding:
LEFT JOIN apps_subscription s ON s.basesubscription_ptr_id = f.id
But, when I do this, the counts of my flight values change! The first few counts for the original query look like:
[327, 605, 78, 5768, 9716, 9686, 7902, 3699, 3323, 6081, 4966, 3456, 3181, 2749, 4577, 3157, 1792, 1806, …]
After joining the table, it becomes:
[327, 738, 78, 8854, 17418, 16156, 13921, 7536, 5380, 10040, 7559, 5461, 6323, 6412, 6702, 5433, 2924, …]
I’m not sure what’s happening herePerhaps the join is creating duplicate rows? The data set is very large, and takes about 30 minutes to run the queryAdding a LIMIT doesn’t seem to speed it up, so as you can imagine, testing takes a little while(If I’m oblivious to another way to speed up the query aside from a LIMIT, feel free to make me aware)!
Thanks for any info.