Introduction
Data classification is the foundational step in any DPDPA compliance programme. Before an organisation can implement consent mechanisms, enforce retention policies, respond to Data Principal rights requests, or report breaches, it must first know what personal data it holds, where it resides, and how sensitive it is. The DPDPA defines personal data broadly, and while the Act does not create a separate category of 'sensitive personal data' as the earlier 2019 bill did, the practical reality is that different types of personal data carry different risk profiles and require different levels of protection. A robust data classification framework enables organisations to apply proportionate controls, prioritise compliance efforts, and build the data inventory that underpins every other DPDPA obligation.
What Constitutes Personal Data Under the DPDPA
The DPDPA defines personal data as 'any data about an individual who is identifiable by or in relation to such data.' This is a deliberately broad definition that captures any information that can identify a specific individual, either directly or indirectly. Direct identifiers include names, photographs, email addresses, phone numbers, government-issued identity numbers (Aadhaar, PAN, passport), and biometric data. Indirect identifiers include data that, when combined with other information, can identify an individual - such as device identifiers, IP addresses, location data, browsing history, purchase patterns, and employee identification numbers. The definition also extends to inferred or derived data - personal profiles, credit scores, health risk assessments, and behavioural predictions - because these are data 'about' an identifiable individual even if not directly provided by them.
- Direct identifiers - name, photograph, email address, phone number, Aadhaar number, PAN, passport number
- Government-issued identity data - voter ID, driving licence, ration card number
- Financial data - bank account numbers, credit card details, UPI IDs, transaction history, credit scores
- Health data - medical records, prescriptions, insurance claims, health conditions, diagnostic reports
- Biometric data - fingerprints, iris scans, facial recognition data, voice prints
- Location data - GPS coordinates, address history, travel patterns, IP-based geolocation
- Digital identifiers - device IDs, cookie identifiers, advertising IDs, browsing history, app usage patterns
- Employment data - employee ID, salary details, performance records, background verification results
The Sensitivity Spectrum
While the DPDPA does not explicitly create a category of 'sensitive personal data' as the earlier Personal Data Protection Bill, 2019 proposed, and as the IT Rules, 2011 currently define, practical data classification must account for varying levels of sensitivity. Not all personal data carries the same risk if compromised. An individual's name being exposed in a breach has a very different impact than their Aadhaar number, health records, or biometric data being compromised. A pragmatic classification framework for DPDPA compliance should define at least three tiers of sensitivity: general personal data (names, email addresses, phone numbers), sensitive personal data (financial data, health data, government identity numbers, biometric data), and highly sensitive personal data (children's data, data related to sexual orientation or religious beliefs, genetic data). Each tier should have corresponding controls for access, encryption, retention, and cross-border transfer that reflect the risk profile.
Building a Data Classification Framework
A data classification framework provides the structured approach for consistently categorising data across the organisation. The framework should define classification levels, criteria for each level, handling rules for each level, and the processes for classifying and reclassifying data.
- Define classification levels - at minimum: Public, Internal, Confidential, and Restricted. Map personal data sensitivity tiers to these levels
- Establish classification criteria - clear rules that determine which level applies. For personal data, criteria should include the type of identifier, the context of processing, and the potential impact of exposure
- Define handling rules per level - access controls, encryption requirements (aligned with standards like ISO 27001), transmission protocols, storage locations, retention periods, and disposal methods appropriate to each classification level
- Assign classification responsibility - data owners classify data at the time of creation or collection. For existing data, a classification project is needed to retrospectively label data
- Create a classification labelling mechanism - technical metadata tags, document headers, database column annotations, or automated labels that make the classification visible to systems and users
- Establish reclassification procedures - processes for reviewing and updating classifications when the context of data use changes or when regulatory requirements evolve
Manual vs Automated Classification
Manual data classification - where humans review data and assign classification labels - is accurate for individual decisions but fundamentally unscalable for modern data volumes. An average Indian enterprise creates and processes millions of data records daily across dozens of systems. Manual classification cannot keep pace, leading to unclassified data that becomes a compliance blind spot. Automated classification uses technology to scan, analyse, and label data based on predefined rules and patterns. Rule-based automation uses regular expressions and pattern matching to identify known personal data formats - Aadhaar patterns (12 digits with specific validation), PAN format (5 letters, 4 digits, 1 letter), email formats, and phone number patterns. Machine learning-based automation goes further, using trained models to identify personal data in unstructured text, recognise context-dependent sensitivity, and handle ambiguous cases. The optimal approach combines automation with human oversight - automated classification handles the volume, while human reviewers validate edge cases and approve classifications for sensitive categories.
Practical Classification Challenges
Implementing data classification in a real enterprise environment reveals several practical challenges that organisations must anticipate and address. Legacy systems often store personal data in unstructured formats - free-text fields, scanned documents, PDF attachments - that resist automated classification. Data spread across shadow IT systems, personal drives, and unsanctioned cloud services may not be visible to classification tools. Multilingual data is common in Indian enterprises, where personal data may be stored in Hindi, Tamil, Marathi, or other regional languages alongside English. Personal data embedded in images - photographs of identity documents, screenshots of forms, handwritten notes - requires optical character recognition (OCR) and computer vision capabilities. Data that moves between systems undergoes transformations - aggregation, anonymisation, pseudonymisation - that can change its classification status. Dynamic data environments where new applications and data sources are constantly being introduced require classification processes that are continuously active rather than one-time exercises.
Classification and Data Principal Rights
Data classification directly enables the fulfilment of Data Principal rights under the DPDPA. When a Data Principal exercises their right to access, the organisation must be able to locate all personal data related to that individual across all systems - which requires that personal data has been identified and classified. The right to correction requires knowing where inaccurate data resides and propagating corrections across all instances. The right to erasure requires identifying every copy and derivative of the Data Principal's personal data for deletion or anonymisation. Without comprehensive classification, responding to these rights requests becomes a manual search across every system in the organisation - a process that is slow, incomplete, and exposes the organisation to regulatory risk for inadequate responses. Well-classified data with proper metadata enables automated, complete, and timely responses to Data Principal rights requests.
- Right to Access - classification enables rapid identification of all personal data related to a specific Data Principal
- Right to Correction - classification and data lineage ensure corrections propagate to all instances and derivatives
- Right to Erasure - classification identifies every copy for deletion, including backups and archived data
- Consent Management - classification links personal data to the consent under which it was collected and processed
- Breach Notification - classification enables rapid assessment of which personal data was compromised and which Data Principals are affected
How Kraver.ai Automates Data Classification for DPDPA
Kraver.ai's AI-powered data classification engine is specifically built for the Indian data landscape. Our machine learning models are trained on Indian personal data formats, multilingual content (including Hindi, Tamil, Telugu, Kannada, and other regional languages), and the specific classification requirements of the DPDPA. The engine automatically scans structured databases, unstructured documents, cloud storage, email archives, and SaaS applications to identify and classify personal data. Computer vision capabilities extend classification to images and scanned documents, identifying Aadhaar cards, PAN cards, and other identity documents embedded in attachments and uploads. Each data element is tagged with its classification level, the Data Principal it relates to, the consent under which it was collected, and its retention status - creating a living data inventory that powers all downstream DPDPA compliance workflows. With Kraver.ai, data classification is not a one-time project but a continuously running, self-improving process that keeps your data inventory current as your business evolves.