An introduction to Joblib
The data engineer at my work recently told me about Joblib and I’m hooked! This makes the code run so much faster and there is no going back anymore. Goodbye chunky for loops and hello parallelization!
This is used for parallel processing the for loop to improve run time. Lets take a simple example:
# Create a simple function which will be called multiple times in for loop
def square(x):time.sleep(1) # Adding extra time to ensure we see the differencereturn x*x
Let's call the for loop:
start = datetime.now()for i in range(10): print(square(i))end = datetime.now()print(f'Time taken: {end-start}')
Output — Time taken: 0:00:10.096081
Let’s use joblib parallelization and use 2 cores:
# Calling the for loop, using 2 coresstart = datetime.now()def parallel_example(i): return square(i)out = Parallel(n_jobs=2)(delayed(parallel_example)(i) for i in range(10))end = datetime.now()print(f'Time taken: {end-start}')
Output — Time taken: 0:00:06.081070
Let's use joblib with maximum cores:
# Calling the for loop, using all the coresstart = datetime.now()def parallel_example(i): return square(i)out = Parallel(n_jobs=10)(delayed(parallel_example)(i) for i in range(10))end = datetime.now()print(f'Time taken: {end-start}')
Output — Time taken: 0:00:02.096398
As you notice, the loop processes parallelly and therefore takes comparatively less time.
For proper formatting, please visit my github link — https://github.com/Neelam-Singhal/data_engineering/blob/master/JobLib_intro.ipynb