tutorial,

AI - A simple way to collect your deep learning image dataset

中文阅读

From https://medium.com/analytics-vidhya/a-simple-way-to-collect-your-deep-learning-image-dataset-4ead47b6826c

Deep Learning has become the go-to method for solving many challenging problems. As we know, with enough training, a deep network can segment and identify the “key points” in the image.

If a very simple mechanism is large enough, it will have a magical effect.

Therefore, this well-functioning deep learning requires a lot of data. The more training data, the better the accuracy of the model.

But where do we get all this data from? Well-annotated data can be both expensive and time-consuming to acquire. Hiring people to manually collect images and label them is not efficient at all. And, in the deep learning era, data is very well arguably your most valuable resource.

Here, I show a simple way to collect your deep learning image dataset.

The bing-images is a Python library to fetch image URLs and download using multithreading from Bing.com. It has the following features

  • Support file type filters.
  • Support Bing.com filterui filters.
  • Download using multithreading and custom thread pool size.
  • Support purely obtaining the image URLs.

Demo

Create a demo project, called image-collector here.

Install bing-images

Requirements

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
❯ pip install bing-images
Collecting bing-images
  Downloading bing_images-0.0.6-py3-none-any.whl (6.7 kB)
Collecting requests>=2.24.0
  Using cached requests-2.25.1-py2.py3-none-any.whl (61 kB)
Collecting selenium>=3.141.0
  Using cached selenium-3.141.0-py2.py3-none-any.whl (904 kB)
Collecting urllib3<1.27,>=1.21.1
  Using cached urllib3-1.26.3-py2.py3-none-any.whl (137 kB)
Requirement already satisfied: certifi>=2017.4.17 in /Users/catchzeng/miniconda3/envs/test/lib/python3.8/site-packages (from requests>=2.24.0->bing-images) (2020.12.5)
Collecting idna<3,>=2.5
  Using cached idna-2.10-py2.py3-none-any.whl (58 kB)
Collecting chardet<5,>=3.0.2
  Using cached chardet-4.0.0-py2.py3-none-any.whl (178 kB)
Installing collected packages: urllib3, idna, chardet, selenium, requests, bing-images
Successfully installed bing-images-0.1.0 chardet-4.0.0 idna-2.10 requests-2.25.1 selenium-3.141.0 urllib3-1.26.3

Fetch image URLs

fetch_image_urls.py

1
2
3
4
5
6
7
8
from bing_images import bing

urls = bing.fetch_image_urls("cat", limit=10, file_type='png', filters='+filterui:aspect-square+filterui:color2-bw')
print("{} images.".format(len(urls)))
counter = 1
for url in urls:
    print("{}: {}".format(counter, url))
    counter += 1

Run

1
2
3
4
5
6
7
8
9
10
11
12
❯ python fetch_image_urls.py
10 images.
1: http://pngimg.com/uploads/cat/cat_PNG50521.png
2: http://pngimg.com/uploads/cat/cat_PNG1616.png
3: https://pngimg.com/uploads/cat/cat_PNG50532.png
4: https://pngimg.com/uploads/cat/cat_PNG1621.png
5: https://pngimg.com/uploads/cat/cat_PNG1618.png
6: http://pngimg.com/uploads/cat/cat_PNG1624.png
7: http://www.pngmart.com/files/5/Black-Cat-PNG-Transparent.png
8: http://www.myiconfinder.com/uploads/iconsets/256-256-a96249f4c8a9753fd904f8be023dc25c-cat.png
9: https://pngimg.com/uploads/cat/cat_PNG1619.png
10: http://pngimg.com/uploads/cat/cat_PNG50521.png

Download using multithreading

download.py

1
2
3
4
5
6
7
8
from bing_images import bing

bing.download_images("cat",
                      20,
                      output_dir="/Users/catchzeng/Desktop/cat",
                      pool_size=10,
                      file_type="png",
                      force_replace=True)

Run

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
❯ python download.py
Save path: /Users/catchzeng/Desktop/cat
Downloading images
#1 http://pngimg.com/uploads/cat/cat_PNG50509.png Downloaded
#2 https://pngimg.com/uploads/cat/cat_PNG50498.png Downloaded
#3 http://www.freepngimg.com/download/cat/22193-3-adorable-cat.png Downloaded
#4 http://pngimg.com/uploads/cat/cat_PNG106.png Downloaded
#5 https://pngimg.com/uploads/cat/cat_PNG50465.png Downloaded
#6 https://pngimg.com/uploads/cat/cat_PNG50417.png Downloaded
#7 https://pngimg.com/uploads/cat/cat_PNG50480.png Downloaded
#8 http://pngimg.com/uploads/cat/cat_PNG119.png Downloaded
#9 https://pngimg.com/uploads/cat/cat_PNG50438.png Downloaded
#10 http://pngimg.com/uploads/cat/cat_PNG100.png Downloaded
#11 https://pngimg.com/uploads/cat/cat_PNG50447.png Downloaded
#12 https://pngimg.com/uploads/cat/cat_PNG50440.png Downloaded
#13 https://pngimg.com/uploads/cat/cat_PNG50433.png Downloaded
#14 https://www.pngarts.com/files/1/Baby-Cat-PNG-Free-Download.png Downloaded
#15 https://cdn.pixabay.com/photo/2017/02/22/16/55/cat-2089916_960_720.png Downloaded
#16 https://pngimg.com/uploads/cat/cat_PNG50434.png Downloaded
#17 http://pngimg.com/uploads/cat/cat_PNG50529.png Downloaded
#18 http://pngimg.com/uploads/cat/cat_PNG113.png Downloaded
#19 https://purepng.com/public/uploads/large/purepng.com-catanimalscat-981524673949tj5ns.png Downloaded
#20 https://pngimg.com/uploads/cat/cat_PNG50435.png Downloaded
Renaming images
Finished renaming
Done
Elapsed time: 20.76s

Download square black-white images

download-square.py

1
2
3
4
5
6
7
8
9
from bing_images import bing

bing.download_images("cat",
                      20,
                      output_dir="/Users/catchzeng/Desktop/cat",
                      pool_size=20,
                      file_type="png",
                      filters='+filterui:aspect-square+filterui:color2-bw',
                      force_replace=True)

The detailed code is at https://github.com/CatchZeng/bing_images. See you!


CatchZeng
Written by CatchZeng Follow
AI (Machine Learning) and DevOps enthusiast.