Image Processing #11 - Speed up image processing with numba
I have written several blogs related to watermarking.
- Image Processing #4 - WaterMark without alpha channel
- Image Processing #5 - WaterMark using alpha channel
The example Python code provided in these articles performs RGB operations on all or some pixels of an image. If it is a 2048 x 1152 size image, at least Width x Height x Channel = 7,077,888 calculations are required. And if the video is 30 frames per second, 212,336,640 operations per second are required.
The Python language is easy to learn, and modules with numerous functions are provided, so high code productivity can be expected. However, if the number of repetitions of loop statements such as for and while increases, the performance decreases a lot. The reason is that because the Python language is interpreted, it reinterprets every sentence in every loop.
Let's compare the performance by making the following simple example in C and Python.
This simple example adds 1 to 10000000.
#include <stdio.h> // for printf() #include <sys/time.h> // for clock_gettime() void loop_func(){ unsigned long sum = 0; for(int x = 0; x < 10000000; x++){ sum += x; } printf("SUM[0 ~ 10000000] is %ld\n", sum); return; } int main() { struct timeval start, end; long secs_used,micros_used; gettimeofday(&start, NULL); loop_func(); gettimeofday(&end, NULL); secs_used=(end.tv_sec - start.tv_sec); //avoid overflow by subtracting first micros_used= ((secs_used*1000000) + end.tv_usec) - (start.tv_usec); float secs_elapsed = micros_used * 1.0 / 1000000; printf("micros_used: %ld\n",micros_used); printf("sec used: %10.6f\n",secs_elapsed); return 0; }
<for.c>
Now compile and run the code.
root@ubuntusrv:/usr/local/src/study/tmp# gcc for.c root@ubuntusrv:/usr/local/src/study/tmp# ./a.out SUM[0 ~ 10000000] is 49999995000000 micros_used: 19571 sec used: 0.019571
Using an executable program made in C language and compiled, it takes 0.019571 seconds to add up to 1 ~ 10000000 on my computer.
This time, I'll test it with Python code that does the same thing.
import time def loop_func(): sum = 0 for x in range(10000000): sum += x print("SUM[0 ~ 10000000] is %ld"%sum); start = time.time() loop_func() end = time.time() print("sec used: %10.6f"%(end - start))
<for.py>
Now run the code.
root@ubuntusrv:/usr/local/src/study/tmp# python3 for.py SUM[0 ~ 10000000] is 49999995000000 sec used: 0.950677
This time it took 0.950677 seconds. A program written in C is about 49 times faster.
This figure clearly depends on the content of the program. As in the example above, if there are many simple loop statements, there is a large difference in speed. However, using a modular math module like numpy reduces the speed difference. The reason is that numpy is written in c, so it works quickly inside the numpy function.
However, you can also run partially compiled code in Python. In the above case, if code made in pure Python language or some modules such as numpy are used, some code can be pre-compiled and then executed using the numba module using the JIT (Just in Time) compiler(llvmlite).
Install numba and compare performance
First, install numba with the pip3 command.
root@ubuntusrv:/usr/local/src/study/tmp# pip3 install numba Collecting numba Downloading numba-0.53.0-cp38-cp38-manylinux2014_x86_64.whl (3.4 MB) |████████████████████████████████| 3.4 MB 2.0 MB/s Collecting llvmlite<0.37,>=0.36.0rc1 Downloading llvmlite-0.36.0-cp38-cp38-manylinux2010_x86_64.whl (25.3 MB) |████████████████████████████████| 25.3 MB 383 kB/s Collecting numpy>=1.15 Downloading numpy-1.20.1-cp38-cp38-manylinux2010_x86_64.whl (15.4 MB) |████████████████████████████████| 15.4 MB 7.3 MB/s Requirement already satisfied: setuptools in /usr/lib/python3/dist-packages (from numba) (45.2.0) Installing collected packages: llvmlite, numpy, numba Successfully installed llvmlite-0.36.0 numba-0.53.0 numpy-1.20.1
Now, let's compile the simple iteration function loop_func of the for.py code above using numba and run it.
There are many features in numba, but first, let's use only the easiest to use.
Import jit from the numba module and add decorators to the function you want to JIT compilation. The detailed syntax for this decorator is described in detail in the Numba documentation.
import time from numba import jit @jit(nopython=True, cache=True) def loop_func(): sum = 0 for x in range(10000000): sum += x print("SUM[0 ~ 10000000] is " + str(sum)); start = time.time() loop_func() end = time.time() print("sec used: %10.6f"%(end - start))
<for_numba.py>
When using numba, I get an error when I use the format print function.
print("SUM[0 ~ 10000000] is %ld"%sum);
print("SUM[0 ~ 10000000] is "+ str(sum));
If you run the code, there is no difference in the first run. But from the second you can see that it is about 25 times faster. The reason is that the JIT compiler is running at the time of first execution, creating compiled executable code, and using this code from the second time on. This is possible because we used the cache=True option in the decorator.
If this value is set to False, performance can be improved when the JIT-compiled function is repeatedly executed in one execution.
root@ubuntusrv:/usr/local/src/study/tmp# python3 for_numba.py SUM[0 ~ 10000000] is 49999995000000 sec used: 1.026282 root@ubuntusrv:/usr/local/src/study/tmp# python3 for_numba.py SUM[0 ~ 10000000] is 49999995000000 sec used: 0.046993
Now there is only about twice the performance difference from the C language.
Improving pixel operation speed using Numba
Image Processing #5-Let's compare the performance using the example of WaterMark using alpha channel.
I have added a print statement in the example code and a part that prints the time used for watermarking.
import argparse import cv2 import numpy as np import time ''' This function remains background rate should be : 0 < rate < 1.0 ''' def process_alpha_masking(base, mask, pos): h, w, c = mask.shape hb, wb, _ = base.shape x = pos[0] y = pos[1] #check mask position if(x > wb or y > hb): print(' invalid overlay position (' + str(x) + ',' + str(y) + ')') return None #remove alpha channel if c != 4: print('mask image file does not have alpha channel') return None #adjust mask if(x + w > wb): mask = mask[:, 0:wb - x] print(' mask X size adjust W:' + str(w) + ' -> W:' + str(wb - x)) if(y + h > hb): mask = mask[0:hb - y, :] print(' mask Y size adjust H:' + str(h) + ' -> H:' + str(hb - y)) h, w, c = mask.shape img = base.copy() bg = img[y:y+h, x:x+w] #overlay area for i in range(0, h): for j in range(0, w): B = mask[i][j][0] G = mask[i][j][1] R = mask[i][j][2] alpha = mask[i][j][3] * 1.0 / 255.0 if (alpha > 0.0): bg[i][j][0] = int(B * alpha + bg[i][j][0] * (1 - alpha)) bg[i][j][1] = int(G * alpha + bg[i][j][1] * (1 - alpha)) bg[i][j][2] = int(R * alpha + bg[i][j][2] * (1 - alpha)) img[y:y+h, x:x+w] = bg return img parser = argparse.ArgumentParser(description="OpenCV Example") parser.add_argument("--file", type=str, required=True, help="filename of the input image to process") parser.add_argument("--mask", type=str, required=True, help="mask image to overlay") args = parser.parse_args() img = cv2.imread(args.file, cv2.IMREAD_COLOR) height, width, channels = img.shape print("image H:%d W:%d, Channel:%d"%(height, width, channels)) cv2.imwrite('/tmp/original.jpg', img) mark = cv2.imread(args.mask, cv2.IMREAD_UNCHANGED) mheight, mwidth, mchannels = mark.shape print("mask H:%d W:%d, Channel:%d"%(mheight, mwidth, mchannels)) x = np.amin( [mwidth, width]) y = np.amin( [mheight, height]) start = time.time() for x in range(10): new_img = process_alpha_masking(img, mark, (x, y)) end = time.time() print("sec used: %10.6f"%(end - start)) if new_img is not None : cv2.imwrite('/tmp/masked.jpg', new_img)
<watermark_alpha.py>
And the following is the code using numba. Except for the decorator, nothing has changed.
import argparse import cv2 import numpy as np import time from numba import jit ''' This function remains background rate should be : 0 < rate < 1.0 ''' @jit(nopython=True, cache=True) def process_alpha_masking(base, mask, pos): h, w, c = mask.shape hb, wb, _ = base.shape x = pos[0] y = pos[1] #check mask position if(x > wb or y > hb): print(' invalid overlay position (' + str(x) + ',' + str(y) + ')') return None #remove alpha channel if c != 4: print('mask image file does not have alpha channel') return None #adjust mask if(x + w > wb): mask = mask[:, 0:wb - x] print(' mask X size adjust W:' + str(w) + ' -> W:' + str(wb - x)) if(y + h > hb): mask = mask[0:hb - y, :] print(' mask Y size adjust H:' + str(h) + ' -> H:' + str(hb - y)) h, w, c = mask.shape img = base.copy() bg = img[y:y+h, x:x+w] #overlay area for i in range(0, h): for j in range(0, w): B = mask[i][j][0] G = mask[i][j][1] R = mask[i][j][2] alpha = mask[i][j][3] * 1.0 / 255.0 if (alpha > 0.0): bg[i][j][0] = int(B * alpha + bg[i][j][0] * (1 - alpha)) bg[i][j][1] = int(G * alpha + bg[i][j][1] * (1 - alpha)) bg[i][j][2] = int(R * alpha + bg[i][j][2] * (1 - alpha)) img[y:y+h, x:x+w] = bg return img parser = argparse.ArgumentParser(description="OpenCV Example") parser.add_argument("--file", type=str, required=True, help="filename of the input image to process") parser.add_argument("--mask", type=str, required=True, help="mask image to overlay") args = parser.parse_args() img = cv2.imread(args.file, cv2.IMREAD_COLOR) height, width, channels = img.shape print("image H:%d W:%d, Channel:%d"%(height, width, channels)) cv2.imwrite('/tmp/original.jpg', img) mark = cv2.imread(args.mask, cv2.IMREAD_UNCHANGED) mheight, mwidth, mchannels = mark.shape print("mask H:%d W:%d, Channel:%d"%(mheight, mwidth, mchannels)) x = np.amin( [mwidth, width]) y = np.amin( [mheight, height]) start = time.time() for x in range(10): new_img = process_alpha_masking(img, mark, (x, y)) end = time.time() print("sec used: %10.6f"%(end - start)) if new_img is not None : cv2.imwrite('/tmp/masked.jpg', new_img)
<watermark_alpha_numba.py>
Now let's compare the performance of the two codes.
root@ubuntusrv:/usr/local/src/study/tmp# python3 watermark_alpha_numba.py --file=biden.jpg --mask=watermark.png image H:727 W:320, Channel:3 mask H:32 W:100, Channel:4 sec used: 1.818784 root@ubuntusrv:/usr/local/src/study/tmp# python3 watermark_alpha_numba.py --file=biden.jpg --mask=watermark.png image H:727 W:320, Channel:3 mask H:32 W:100, Channel:4 sec used: 0.044184 root@ubuntusrv:/usr/local/src/study/tmp# python3 watermark_alpha.py --file=biden.jpg --mask=watermark.png image H:727 W:320, Channel:3 mask H:32 W:100, Channel:4 sec used: 0.333172
When using numba, it took a lot of time because the JIT compiler works on the first run, but from the second you can see that it works quite quickly. Compared to the case where numba is not used, the performance difference is about 7.5 times.
Both programs make watermarking as follows.
Wrapping Up
If you apply JIT compilation using numba to functions that use a lot of loop statements in Python code, you can see a significant performance improvement.
However, numba is not applicable in all cases. Due to the nature of JIT, it is often impossible to apply it.
You can refer to the Numba documentation to learn a wide range of numba features and uses.
댓글
댓글 쓰기