Teren-app/DEDUPLICATION_PLAN_V2.md
Simon Pocrnjič dea7432deb changes
2025-12-26 22:39:58 +01:00

655 lines
20 KiB
Markdown

# V2 Deduplication Implementation Plan
## Problem Statement
Currently, ImportServiceV2 allows duplicate Person records and related entities when:
1. A ClientCase with the same `client_ref` already exists in the database
2. A Contract with the same `reference` already exists for the client
3. Person data is present in the import row
This causes data duplication because V2 doesn't check for existing entities before creating Person and related entities (addresses, phones, emails, activities).
## V1 Deduplication Strategy (Analysis)
### V1 Person Resolution Order (Lines 913-1015)
V1 follows this hierarchical lookup before creating a new Person:
1. **Contract Reference Lookup** (Lines 913-922)
- If contract.reference exists → Find existing Contract → Get ClientCase → Get Person
- Prevents creating new Person when Contract already exists
2. **Account Result Derivation** (Lines 924-936)
- If Account processing resolved/created a Contract → Get ClientCase → Get Person
3. **ClientCase.client_ref Lookup** (Lines 937-945)
- If client_ref exists → Find ClientCase by (client_id, client_ref) → Get Person
- Prevents creating new Person when ClientCase already exists
4. **Contact Values Lookup** (Lines 949-964)
- Check Email.value → Get Person
- Check PersonPhone.nu → Get Person
- Check PersonAddress.address → Get Person
5. **Person Identifiers Lookup** (Lines 1005-1007)
- Check tax_number, ssn, etc. via `findPersonIdByIdentifiers()`
6. **Create New Person** (Lines 1009-1011)
- Only if all above fail
### V1 Contract Deduplication (Lines 2158-2196)
**Early Contract Lookup** (Lines 2168-2180):
```php
// Try to find existing contract EARLY by (client_id, reference)
// across all cases to prevent duplicates
$existing = Contract::query()->withTrashed()
->join('client_cases', 'contracts.client_case_id', '=', 'client_cases.id')
->where('client_cases.client_id', $clientId)
->where('contracts.reference', $reference)
->select('contracts.*')
->first();
```
**ClientCase Reuse Logic** (Lines 2214-2228):
```php
// If we have a client and client_ref, try to reuse existing case
// to avoid creating extra persons
if ($clientId && $clientRef) {
$cc = ClientCase::where('client_id', $clientId)
->where('client_ref', $clientRef)
->first();
if ($cc) {
// Reuse this case
$clientCaseId = $cc->id;
// If case has no person yet, set it
if (!$cc->person_id) {
// Find or create person and attach
}
}
}
```
### Key V1 Design Principles
**Resolution before Creation** - Always check for existing entities first
**Chain Derivation** - Contract → ClientCase → Person (reuse existing chain)
**Contact Deduplication** - Match by email/phone/address before creating
**Client-Scoped Lookups** - All queries scoped to import.client_id
**Minimal Person Creation** - Only create Person as last resort
## V2 Current Architecture Issues
### Problem Areas
1. **PersonHandler** (`app/Services/Import/Handlers/PersonHandler.php`)
- Currently only deduplicates by tax_number/ssn (Lines 38-58)
- Doesn't check if Person exists via Contract/ClientCase
- Processes independently without context awareness
2. **ClientCaseHandler** (`app/Services/Import/Handlers/ClientCaseHandler.php`)
- Correctly resolves by client_ref (Lines 16-27)
- But doesn't prevent PersonHandler from running afterwards
3. **ContractHandler** (`app/Services/Import/Handlers/ContractHandler.php`)
- Missing early resolution logic
- Doesn't derive Person from existing Contract chain
4. **Processing Order Issue**
- Current priority: Person(100) → ClientCase(95) → Contract(90)
- Person runs BEFORE we know if ClientCase/Contract exists
- Should be reversed: Contract → ClientCase → Person
## V2 Deduplication Plan
### Phase 1: Reverse Processing Order ✅
**Change entity priorities in database seeder:**
```php
// NEW ORDER (descending priority)
Contract: 100
ClientCase: 95
Person: 90
Email: 80
Address: 70
Phone: 60
Account: 50
Payment: 40
Activity: 30
```
**Rationale:** Process high-level entities first (Contract, ClientCase) so we can derive Person from existing chains.
### Phase 2: Early Resolution Service 🔧
**Create:** `app/Services/Import/EntityResolutionService.php`
This service will be called BEFORE handlers process entities:
```php
class EntityResolutionService
{
/**
* Resolve Person ID from import context (existing entities).
* Returns Person ID if found, null otherwise.
*/
public function resolvePersonFromContext(
Import $import,
array $mapped,
array $context
): ?int {
// 1. Check if Contract already processed
if ($contract = $context['contract']['entity'] ?? null) {
$personId = $this->getPersonFromContract($contract);
if ($personId) return $personId;
}
// 2. Check if ClientCase already processed
if ($clientCase = $context['client_case']['entity'] ?? null) {
if ($clientCase->person_id) {
return $clientCase->person_id;
}
}
// 3. Check for existing Contract by reference
if ($contractRef = $mapped['contract']['reference'] ?? null) {
$personId = $this->getPersonFromContractReference(
$import->client_id,
$contractRef
);
if ($personId) return $personId;
}
// 4. Check for existing ClientCase by client_ref
if ($clientRef = $mapped['client_case']['client_ref'] ?? null) {
$personId = $this->getPersonFromClientRef(
$import->client_id,
$clientRef
);
if ($personId) return $personId;
}
// 5. Check for existing Person by contact values
$personId = $this->resolvePersonByContacts($mapped);
if ($personId) return $personId;
return null; // No existing Person found
}
/**
* Check if ClientCase exists for this client_ref.
*/
public function clientCaseExists(int $clientId, string $clientRef): bool
{
return ClientCase::where('client_id', $clientId)
->where('client_ref', $clientRef)
->exists();
}
/**
* Check if Contract exists for this reference.
*/
public function contractExists(int $clientId, string $reference): bool
{
return Contract::query()
->join('client_cases', 'contracts.client_case_id', '=', 'client_cases.id')
->where('client_cases.client_id', $clientId)
->where('contracts.reference', $reference)
->exists();
}
private function getPersonFromContract(Contract $contract): ?int
{
if ($contract->client_case_id) {
return ClientCase::where('id', $contract->client_case_id)
->value('person_id');
}
return null;
}
private function getPersonFromContractReference(
?int $clientId,
string $reference
): ?int {
if (!$clientId) return null;
$clientCaseId = Contract::query()
->join('client_cases', 'contracts.client_case_id', '=', 'client_cases.id')
->where('client_cases.client_id', $clientId)
->where('contracts.reference', $reference)
->value('contracts.client_case_id');
if ($clientCaseId) {
return ClientCase::where('id', $clientCaseId)
->value('person_id');
}
return null;
}
private function getPersonFromClientRef(
?int $clientId,
string $clientRef
): ?int {
if (!$clientId) return null;
return ClientCase::where('client_id', $clientId)
->where('client_ref', $clientRef)
->value('person_id');
}
private function resolvePersonByContacts(array $mapped): ?int
{
// Check email
if ($email = $mapped['email']['value'] ?? $mapped['emails'][0]['value'] ?? null) {
$personId = Email::where('value', trim($email))->value('person_id');
if ($personId) return $personId;
}
// Check phone
if ($phone = $mapped['phone']['nu'] ?? $mapped['person_phones'][0]['nu'] ?? null) {
$personId = PersonPhone::where('nu', trim($phone))->value('person_id');
if ($personId) return $personId;
}
// Check address
if ($address = $mapped['address']['address'] ?? $mapped['person_addresses'][0]['address'] ?? null) {
$personId = PersonAddress::where('address', trim($address))->value('person_id');
if ($personId) return $personId;
}
return null;
}
}
```
### Phase 3: Update PersonHandler 🔧
**Modify:** `app/Services/Import/Handlers/PersonHandler.php`
Add resolution service check before creating:
```php
public function process(Import $import, array $mapped, array $raw, array $context = []): array
{
// FIRST: Check if Person already resolved from context
$resolutionService = app(EntityResolutionService::class);
$existingPersonId = $resolutionService->resolvePersonFromContext(
$import,
$mapped,
$context
);
if ($existingPersonId) {
$existing = Person::find($existingPersonId);
// Update if configured
$mode = $this->getOption('update_mode', 'update');
if ($mode === 'skip') {
return [
'action' => 'skipped',
'entity' => $existing,
'message' => 'Person already exists (found via Contract/ClientCase chain)',
];
}
// Update logic...
return [
'action' => 'updated',
'entity' => $existing,
'count' => 1,
];
}
// SECOND: Try existing deduplication (tax_number, ssn)
$existing = $this->resolve($mapped, $context);
if ($existing) {
// Update logic...
}
// THIRD: Check contacts deduplication
$personIdFromContacts = $resolutionService->resolvePersonByContacts($mapped);
if ($personIdFromContacts) {
$existing = Person::find($personIdFromContacts);
// Update logic...
}
// LAST: Create new Person only if all checks failed
$payload = $this->buildPayload($mapped);
$person = Person::create($payload);
return [
'action' => 'inserted',
'entity' => $person,
'count' => 1,
];
}
```
### Phase 4: Update ContractHandler 🔧
**Modify:** `app/Services/Import/Handlers/ContractHandler.php`
Add early Contract lookup and ClientCase reuse:
```php
public function process(Import $import, array $mapped, array $raw, array $context = []): array
{
$clientId = $import->client_id;
$reference = $mapped['reference'] ?? null;
if (!$clientId || !$reference) {
return [
'action' => 'invalid',
'errors' => ['Contract requires client_id and reference'],
];
}
// EARLY LOOKUP: Check if Contract exists across all cases
$existing = Contract::query()
->join('client_cases', 'contracts.client_case_id', '=', 'client_cases.id')
->where('client_cases.client_id', $clientId)
->where('contracts.reference', $reference)
->select('contracts.*')
->first();
if ($existing) {
// Contract exists - update or skip
$mode = $this->getOption('update_mode', 'update');
if ($mode === 'skip') {
return [
'action' => 'skipped',
'entity' => $existing,
'message' => 'Contract already exists',
];
}
// Update logic...
return [
'action' => 'updated',
'entity' => $existing,
'count' => 1,
];
}
// Creating new Contract - resolve/create ClientCase
$clientCaseId = $this->resolveOrCreateClientCase($import, $mapped, $context);
if (!$clientCaseId) {
return [
'action' => 'invalid',
'errors' => ['Unable to resolve client_case_id'],
];
}
// Create Contract
$payload = array_merge($this->buildPayload($mapped), [
'client_case_id' => $clientCaseId,
]);
$contract = Contract::create($payload);
return [
'action' => 'inserted',
'entity' => $contract,
'count' => 1,
];
}
protected function resolveOrCreateClientCase(
Import $import,
array $mapped,
array $context
): ?int {
$clientId = $import->client_id;
$clientRef = $mapped['client_ref'] ??
$context['client_case']['entity']?->client_ref ??
null;
// If ClientCase already processed in this row
if ($clientCaseId = $context['client_case']['entity']?->id ?? null) {
return $clientCaseId;
}
// Try to find existing ClientCase by client_ref
if ($clientRef) {
$existing = ClientCase::where('client_id', $clientId)
->where('client_ref', $clientRef)
->first();
if ($existing) {
// REUSE existing ClientCase (and its Person)
return $existing->id;
}
}
// Create new ClientCase (Person should already be processed)
$personId = $context['person']['entity']?->id ?? null;
if (!$personId) {
// Person wasn't in import, create minimal
$personId = Person::create(['type_id' => 1])->id;
}
$clientCase = ClientCase::create([
'client_id' => $clientId,
'person_id' => $personId,
'client_ref' => $clientRef,
]);
return $clientCase->id;
}
```
### Phase 5: Update ClientCaseHandler 🔧
**Modify:** `app/Services/Import/Handlers/ClientCaseHandler.php`
Ensure it uses resolved Person from context:
```php
public function process(Import $import, array $mapped, array $raw, array $context = []): array
{
$clientId = $import->client_id ?? null;
$clientRef = $mapped['client_ref'] ?? null;
// Get Person from context (should be processed first now)
$personId = $context['person']['entity']?->id ?? null;
if (!$clientId) {
return [
'action' => 'skipped',
'message' => 'ClientCase requires client_id',
];
}
$existing = $this->resolve($mapped, $context);
if ($existing) {
$mode = $this->getOption('update_mode', 'update');
if ($mode === 'skip') {
return [
'action' => 'skipped',
'entity' => $existing,
'message' => 'ClientCase already exists (skip mode)',
];
}
$payload = $this->buildPayload($mapped, $existing);
// Update person_id ONLY if provided and different
if ($personId && $existing->person_id !== $personId) {
$payload['person_id'] = $personId;
}
$appliedFields = $this->trackAppliedFields($existing, $payload);
$existing->update($payload);
return [
'action' => 'updated',
'entity' => $existing,
'count' => 1,
];
}
// Create new ClientCase
$payload = $this->buildPayload($mapped);
// Attach Person if resolved
if ($personId) {
$payload['person_id'] = $personId;
}
$payload['client_id'] = $clientId;
$clientCase = ClientCase::create($payload);
return [
'action' => 'inserted',
'entity' => $clientCase,
'count' => 1,
];
}
```
### Phase 6: Integration into ImportServiceV2 🔧
**Modify:** `app/Services/Import/ImportServiceV2.php`
Inject resolution service into processRow:
```php
protected function processRow(Import $import, array $mapped, array $raw, array $context): array
{
$entityResults = [];
$lastEntityType = null;
$lastEntityId = null;
$hasErrors = false;
// NEW: Add resolution service to context
$context['resolution_service'] = app(EntityResolutionService::class);
// Process entities in configured priority order
foreach ($this->entityConfigs as $root => $config) {
// ... existing logic ...
}
// ... rest of method ...
}
```
## Implementation Checklist
### Step 1: Update Database Priority ✅
- [ ] Modify `database/seeders/ImportEntitiesV2Seeder.php`
- [ ] Change priorities: Contract(100), ClientCase(95), Person(90)
- [ ] Run seeder: `php artisan db:seed --class=ImportEntitiesV2Seeder --force`
### Step 2: Create EntityResolutionService 🔧
- [ ] Create `app/Services/Import/EntityResolutionService.php`
- [ ] Implement all resolution methods
- [ ] Add comprehensive PHPDoc
- [ ] Add logging for debugging
### Step 3: Update PersonHandler 🔧
- [ ] Modify `process()` method to check resolution service first
- [ ] Add contact-based deduplication
- [ ] Ensure proper skip/update modes
### Step 4: Update ContractHandler 🔧
- [ ] Add early Contract lookup (client_id + reference)
- [ ] Implement ClientCase reuse logic
- [ ] Prevent duplicate Contract creation
### Step 5: Update ClientCaseHandler 🔧
- [ ] Use Person from context
- [ ] Handle person_id properly on updates
- [ ] Maintain existing deduplication
### Step 6: Integrate into ImportServiceV2 🔧
- [ ] Add resolution service to context
- [ ] Test with existing imports
### Step 7: Testing 🧪
- [ ] Test import with existing client_ref
- [ ] Test import with existing contract reference
- [ ] Test import with existing email/phone
- [ ] Test mixed scenarios
- [ ] Verify no duplicate Persons created
- [ ] Check all related entities linked correctly
## Expected Behavior After Implementation
### Scenario 1: Existing ClientCase by client_ref
```
Import Row: {client_ref: "B387055", name: "John", email: "john@test.com"}
Before V2 Fix:
❌ Creates new Person (duplicate)
❌ Creates new Email (duplicate)
✅ Reuses ClientCase
After V2 Fix:
✅ Finds existing Person via ClientCase
✅ Updates Person if needed
✅ Reuses ClientCase
✅ Reuses/updates Email
```
### Scenario 2: Existing Contract by reference
```
Import Row: {contract.reference: "REF-123", person.name: "Jane"}
Before V2 Fix:
❌ Creates new Person (duplicate)
❌ Contract might be created or updated
❌ New Person not linked to existing ClientCase
After V2 Fix:
✅ Finds existing Contract
✅ Derives Person from Contract → ClientCase chain
✅ Updates Person if needed
✅ No duplicate Person created
```
### Scenario 3: New Import (no existing entities)
```
Import Row: {client_ref: "NEW-001", name: "Bob"}
Behavior:
✅ Creates new Person
✅ Creates new ClientCase
✅ Links correctly
✅ No duplicates
```
## Success Criteria
**No duplicate Persons** when client_ref or contract reference exists
**Proper entity linking** - all entities connected to correct Person
**Backward compatibility** - existing imports still work
**Skip mode respected** - handlers honor skip/update modes
**Contact deduplication** - matches by email/phone/address
**Performance maintained** - no significant slowdown
## Rollback Plan
If issues occur:
1. Revert priority changes in database
2. Disable EntityResolutionService by commenting out context injection
3. Fall back to original handler behavior
4. Investigate and fix issues
5. Re-implement with fixes
## Notes
- This plan maintains V2's modular handler architecture
- Resolution logic is centralized in EntityResolutionService
- Handlers remain independent but context-aware
- Similar to V1 but cleaner separation of concerns
- Can be implemented incrementally (phase by phase)
- Each phase can be tested independently