一种基于 import 的重复图片查找工具

我的现状

自从高中接触到 Pixiv 之后，我就在收集动画片小女孩的照片，到现在一共收集了快七千张，占用磁盘空间快 20GB ，其中是有一些重复的，比如：

画师先在 Twitter 发布，我保存了一份，画师又在 Pixiv 发布，我又保存了一份
画师在 Pixiv 发布，我保存了一份，因为一些原因被平台下架了，然后画师又重新发布了一次，我又保存了一份

靠肉眼想找出重复图片基本不太可能，我又不太想用在线服务提供的查重工具，所以就想着用世界最强胶水粘一个查重工具

原理

直接 SHA-256 之类的哈希是肯定不行的，因为元数据/图片格式/图片尺寸/敏感部位的遮挡方式/画师的署名都可能会导致哈希不同，改用业界经典的 pHash ， ChatGPT 告诉我大概就是通过什么几把离散余弦变换，把图片提炼成一个非常小尺寸的灰度图， Python 可以直接用 imagehash 库。

提取成类似「哈希」的东西之后怎么找重复呢？ imagehash 提供了计算海明距离的方法，但两两比较时间复杂度是，这是孬的， ChatGPT 告诉我又一种数据结构叫 BKTree ，就是专门为了这个场景设计的， Python 可以直接用 pybktree 库。

持久化就用 sqlite3 ，各方面性能都比 json 好。

实现

点击展开：我的 Python 实现

#!/usr/bin/env python
# coding: utf-8

import logging
import os
import sqlite3

import imagehash
import magic
import matplotlib.pyplot as plt
from PIL import Image
from pybktree import BKTree

logging.basicConfig(level=logging.INFO)

DB_FILENAME = os.path.expanduser("~/Documents/pixiv.image.hash.db")
DB_NAME = "hash"
CREATE_SQL = """CREATE TABLE hash (
  id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
  filename TEXT NOT NULL,
  hash TEXT NOT NULL
)
"""
IMG_DIR = "/mnt/hdd43/imgs/pixiv"
PREVIEW_DIR = os.path.expanduser("~/Pictures/dups")

file_magic = magic.Magic(mime=True)
sqlite3_conn = sqlite3.connect(DB_FILENAME)


def ensure_table() -> None:
    cur = sqlite3_conn.cursor()
    if (
        cur.execute(
            "SELECT name FROM sqlite_master WHERE name = ?", (DB_NAME,)
        ).fetchone()
        is not None
    ):
        logging.info(f"table {DB_NAME} exists.")
        return
    cur.execute(CREATE_SQL)
    logging.info(f"create table {DB_NAME}.")


def is_image(filename: str) -> bool:
    return file_magic.from_file(filename=filename).startswith("image")


def in_db(filename: str) -> bool:
    cur = sqlite3_conn.cursor()
    if (
        cur.execute("SELECT id FROM hash WHERE filename = ?", (filename,)).fetchone()
        is None
    ):
        return False
    return True


def cal_hash_recursively(directory: str = ".") -> None:
    for dirpath, _, filenames in os.walk(directory):
        for filename in filenames:
            f = os.path.normpath(os.path.join(dirpath, filename))
            logging.info(f"processing {f}...")
            if not is_image(f):
                logging.info(f"{f}: not an image, skipping ...")
                continue
            if in_db(f):
                logging.info(f"{f}: already in db, skipping ...")
                continue
            cur = sqlite3_conn.cursor()
            cur.execute(
                "INSERT INTO hash VALUES (?, ?, ?)",
                (
                    None,
                    f,
                    str(imagehash.phash(Image.open(f), hash_size=16)),
                ),
            )
            logging.info(f"{f}: hash saved")
    sqlite3_conn.commit()


def img_hamming(a: tuple, b: tuple):
    return a[1] - b[1]


def construct_bktree() -> BKTree:
    cur = sqlite3_conn.cursor()
    cur.execute("SELECT filename, hash FROM hash")
    row = cur.fetchone()
    hashes = []
    while row is not None:
        hashes.append((row[0], imagehash.hex_to_hash(row[1])))
        row = cur.fetchone()
    return BKTree(img_hamming, hashes)


def find_dups(tree: BKTree) -> list[list]:
    cur = sqlite3_conn.cursor()
    cur.execute("SELECT filename, hash FROM hash")
    row = cur.fetchone()
    dups = []
    while row is not None:
        dup = [i[1][0] for i in tree.find((row[0], imagehash.hex_to_hash(row[1])), 25)]
        if len(dup) > 1:
            dups.append(dup)
        row = cur.fetchone()
    dups_sorted = [list(t) for t in set(tuple(sorted(l)) for l in dups)]
    return dups_sorted


def generate_dup_previews(dups: list[list]) -> None:
    for i, dup in enumerate(dups):
        images = dup
        titles = [os.path.basename(img) for img in images]
        logging.info(f"processing dup #{i}: {images}")
        fig, axes = plt.subplots(
            nrows=1, ncols=len(titles), figsize=(3 * len(titles), 3)
        )
        axes = axes.flatten()
        for j, ax in enumerate(axes):
            # 这里直接用 matplotlib 的 imread 会爆内存，不知道为啥
            with Image.open(images[j]) as img:
                img.thumbnail((1024, 1024))
                ax.imshow(img)
            ax.set_title(titles[j], fontsize=8)
            ax.axis("off")
        fig.tight_layout()
        fig.savefig(os.path.join(PREVIEW_DIR, f"dup.{i}.png"), dpi=300)
        plt.close(fig)


ensure_table()
cal_hash_recursively(IMG_DIR)
tree = construct_bktree()
dups = find_dups(tree)
generate_dup_previews(dups)

目前发现的不足：

单线程，速度感人
对查重结果的「去重」做的孬，比如查 a 发现 a, b, c 是重复的，查 b 发现 a, b 是重复的（因为阈值的配置），那么 a, b, c 和 a, b 会被算作两组重复

结果展示

能展示的实在是太少了

可爱猫猫，这个属于是画师在 Twitter 和 Pixiv 错开时间发布，导致我保存了两次

蓝色恶魔，和上一个情况差不多，有几个 Pixiv ID 改了的情况实在是不太好展示。

总的来说我还是很满意的。

胡言乱语

直到我上大三之前， Pixiv 还是质量很高的网站，里面的画师虽然参差不齐，但可以看出每个画师都热爱插画，每一幅插画都蕴含着自己的画风、自己的思考、自己的努力。反观现在，一大堆「 AI 画师」高速产💩，血も涙もない，他们产出的每个比特对于我的存储设备来说都是病毒，这其中好好标注自己的作品为 AI 生成的画师还好，他们的作品可以轻松屏蔽，可恨就可恨在还有一大堆「真*画师」腹泻式发布他们的 AI 狗屎，但完全不标注 AI 生成，还在各个地方打自己的广告，有偿接稿，我是真想阐述他们的梦，更可恨的是 Pixiv 官方对于这种行为完全不作为，我举报了一万个账号，到现在一个都没处理。

RayAlto's Blog

一种基于 import 的重复图片查找工具

我的现状

原理

实现

结果展示

胡言乱语