A Supervised Learning Recommendation Framework for Linking (Big) Data
The project is funded by DFG as part of the Infrastructure Priority Program New Data Spaces for the Social Sciences (SPP2431) under Grant 539465691.
| (index) | Model (alphanumeric) | Producer (alphanumeric) | Origin (alphanumeric) | Sales (in Mil.) (numeric) |
|---|---|---|---|---|
| 1 | Model T | Ford | USA | 16.5 |
| 2 | Model A | Ford | USA | 4.8 |
| 3 | Beetle | Volkswagen | Germany | 21.5 |
| (index) | Name (alphanumeric) | Firm (alphanumeric) | Country (alphanumeric) | Engine (in Lt.) (numeric) |
|---|---|---|---|---|
| 1 | T Model | Ford | United States | 2.9 |
| 2 | Corolla | Toyota | Japan | 1.8 |
| 3 | Beetle | Volkswagen | Germany | 1.6 |
| 4 | Mdl 124 | Fiat | Italy | 1.4 |
A record matching toy example with two sources.
| (index) | Model (alphanumeric) | Producer (alphanumeric) | Origin (alphanumeric) | Sales (in Mil.) (numeric) |
|---|---|---|---|---|
| 1 | Model T | Ford | USA | 16.5 |
| 2 | Model A | Ford | USA | 4.8 |
| 3 | Beetle | Volkswagen | Germany | 21.5 |
| (index) | Name (alphanumeric) | Firm (alphanumeric) | Country (alphanumeric) | Engine (in Lt.) (numeric) |
|---|---|---|---|---|
| 1 | T Model | Ford | United States | 2.9 |
| 2 | Corolla | Toyota | Japan | 1.8 |
| 3 | Beetle | Volkswagen | Germany | 1.6 |
| 4 | Mdl 124 | Fiat | Italy | 1.4 |
(Levenshtein 1965 similarity) (Model T, Mdl 124) = 0.8
(Hamming 1950 similarity) (Model T, Mdl 124) = 0.75
| (index) | Model (alphanumeric) | Producer (alphanumeric) | Origin (alphanumeric) | Sales (in Mil.) (numeric) |
|---|---|---|---|---|
| 1 | Model T | Ford | USA | 16.5 |
| 2 | Model A | Ford | USA | 4.8 |
| 3 | Beetle | Volkswagen | Germany | 21.5 |
| (index) | Name (alphanumeric) | Firm (alphanumeric) | Country (alphanumeric) | Engine (in Lt.) (numeric) |
|---|---|---|---|---|
| 1 | T Model | Ford | United States | 2.9 |
| 2 | Corolla | Toyota | Japan | 1.8 |
| 3 | Beetle | Volkswagen | Germany | 1.6 |
| 4 | Mdl 124 | Fiat | Italy | 1.4 |
(Levenshtein 1965 similarity) (Model T, T Model) = 0.75
(Token sort ratio) (Model T, T Model) = 1




| Task | Dataset | Domain | Left Fields | Left #Records | Right Fields | Right #Records | #Matches | Rel. | Dirty |
|---|---|---|---|---|---|---|---|---|---|
| (B1) | DBLP-ACM | Bibliographic | 4 (0) | 2,614 | 4 (0) | 2,294 | 2,224 | 1:1 | |
| (B2) | Abt-Buy | E-commerce | 3 (1) | 1,081 | 4 (2) | 1,092 | 1,097 | m:1 | yes |
| (B3) | Amazon-GoogleProducts | E-commerce | 4 (2) | 1,363 | 4 (1) | 3,226 | 1,300 | m:1 | yes |
| EM System | Source | F-score DBLP-ACM | F-score Abt-Buy | F-score Amazon-GoogleProducts |
|---|---|---|---|---|
| Magellan | Mudgal et al. (2018) | 98.4 | 43.6 | 49.1 |
| DeepER | Ebraheem et al. (2018) | 96.0 | 98.6 | |
| DeepMatcher | Mudgal et al. (2018) | 98.4 | 62.8 | 69.3 |
| Ditto | Li et al. (2021) | 99.0 | 75.6 | |
| AdaMEL-hyb | Jin et al. (2021) | 98.9 | 65.1 | |
| RuleSynth | Singh et al. (2017) | 92.6 | 63.8 | |
| CorDEL | Wang et al. (2020) | 99.2 | 64.9 | 70.2 |
| AutoFJ | Li et al. (2021) | 97.7 | 61.3 | |
| ZeroER | Wu et al. (2020) | 96.0 | 52.0 | 48.0 |
| MLMatch | This Article | 99.8 | 76.6 | 83.6 |
| MLMatch Rank | 1. | 1. | 2. |










| (1) Iteration | (2) TP | (3) FP | (4) TN | (5) FN | (6) Accuracy | (7) Precision | (8) Recall | (9) F-Score |
|---|---|---|---|---|---|---|---|---|
| 1 | 256 | 0 | 6430 | 2 | 99.97 | 100 | 99.22 | 99.61 |
| 2 | 253 | 0 | 6430 | 5 | 99.93 | 100 | 98.06 | 99.02 |
| 3 | 256 | 2 | 6428 | 2 | 99.94 | 99.22 | 99.22 | 99.22 |
| 4 | 257 | 0 | 6430 | 1 | 99.99 | 100 | 99.61 | 99.81 |
| 5 | 258 | 4 | 6426 | 0 | 99.94 | 98.47 | 100 | 99.23 |
| Average | 256 | 1.2 | 6428.8 | 2 | 99.95 | 99.53 | 99.22 | 99.38 |
| (1) Iteration | (2) TP | (3) FP | (4) TN | (5) FN | (6) Accuracy | (7) Precision | (8) Recall | (9) F-Score |
|---|---|---|---|---|---|---|---|---|
| 1 | 54 | 0 | 266 | 0 | 100 | 100 | 100 | 100 |
| 2 | 54 | 0 | 266 | 0 | 100 | 100 | 100 | 100 |
| 3 | 54 | 0 | 266 | 0 | 100 | 100 | 100 | 100 |
| 4 | 54 | 0 | 266 | 0 | 100 | 100 | 100 | 100 |
| 5 | 54 | 0 | 266 | 0 | 100 | 100 | 100 | 100 |
| Average | 54 | 0 | 266 | 0 | 100 | 100 | 100 | 100 |
similarity_map = {
"company_name": [
"discrete",
"partial",
my_custom_awesome_similarity
],
"address~address1": [ "partial" ],
"address~address2": [ "partial" ],
"purpose": [
"sort",
lambda x, y: x*y + 0.42 - y*x
],
"foundation": [
"discrete",
"partial"
]
}
model = match.MatchingModel(similarity_map)
model.compile(
loss="binary_crossentropy",
optimizer=tensorflow.keras.optimizers.Adam(learning_rate=0.01),
metrics=evaluation_metrics)
train_left, train_right, train_matches = load_train_data()
model.fit(train_left, train_right, train_matches, epochs=100)
model.evaluate(train_left, train_right, train_matches)
predictions = model.predict(train_left, train_right)
suggestions = model.suggest(train_left, train_right, 3)similarity_map <- list(
company_name = c(
"discrete",
"partial",
my_custom_awesome_similarity
),
`address~address1` = c("partial"),
`address~address2` = c("partial"),
purpose = c(
"sort",
function(x, y) x*y + 0.42 - y*x
),
foundation = c(
"discrete",
"partial"
)
)
model <- matching_model(similarity_map)
model |> compile(
loss = keras::loss_binary_crossentropy(),
optimizer = keras::optimizer_adam(learning_rate = 1e-3),
metrics = evaluation_metrics)
train_left, train_right, train_matches <- load_train_data()
model |> fit(left_train, right_train, matches_train, epochs = 100L)
model |> evaluate(left_test, right_test, matches_test)
predictions <- model |> predict(left, right)
suggestions <- model |> suggest(left, right, count = 3)| Left Field | Right Field | Similarities | Ratios |
|---|---|---|---|
| title | title | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| authors | authors | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| venue | venue | Levenshtein, Jaro-Winkel, discrete | partial, token sort, token set, partial token set, not_missing |
| year | year | Euclidean, Gaussian |
| Left Field | Right Field | Similarities | Ratios |
|---|---|---|---|
| description | description | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| name | name | Levenshtein, Jaro-Winkel, discrete | partial, token sort, token set, partial token set |
| description | name | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| name | description | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| price | price | Levenshtein, Jaro-Winkel, discrete | partial, token sort, token set, partial token set |
| name | manufacturer | partial, partial token set |
| Left Field | Right Field | Similarities | Ratios |
|---|---|---|---|
| description | manufacturer | partial, partial token set | |
| description | description | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set, not missing |
| title | name | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set, not missing |
| description | name | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set, not missing |
| title | description | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set, not missing |
| manufacturer | manufacturer | Levenshtein, Jaro-Winkel, discrete | partial, token sort, token set, partial token set, not missing |
| price | price | Levenshtein, Jaro-Winkel, discrete | partial, token sort, token set, partial token set, not missing |
| Left Field | Right Field | Similarities | Ratios |
|---|---|---|---|
| company name | company name | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| company info 1 | company info 1 | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| company info 2 | company info 2 | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| found date | found date | discrete | |
| found year | found year | discrete | |
| register date | register date | discrete | |
| register year | register year | discrete | |
| concession date | concession date | discrete | |
| concession year | concession year | discrete | |
| statue change date | statue change date | discrete | |
| company name | company info 1 | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| company name | company info 2 | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| company info 1 | company info 2 | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| Left Field | Right Field | Similarities | Ratios |
|---|---|---|---|
| main info | main info | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| Vorstand | Vorstand | Levenshtein, Jaro-Winkel | |
| StVdAR | StVdAR | Levenshtein, Jaro-Winkel | |
| GeschF | GeschF | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| Leiter | Leiter | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| Beirat | Beirat | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| AR | AR | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| name | name | Levenshtein, Jaro-Winkel, discrete | |
| surname | surname | Levenshtein, Jaro-Winkel, discrete | |
| occupation | occupation | Levenshtein, Jaro-Winkel, discrete | |
| address | address | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| birth date | birth date | discrete | |
| raw text | raw text | token set, partial token set |