The terms Data cardinality and Selectivity are quite similar in their operations and many novice SQL developers confuse between these terms. Selectivity can be defined as the ratio between number of rows and cardinality of the table. The selectivity which returns the most post possible rows can be termed as “bad selectivity” whilst the selectivity which returns the lowest possible value is called as “good selectivity”. In every table, the primary key returns with the highest cardinality so it’s ratio with number of rows is called as the “best possible selectivity”.
How to Calculate Selectivity in a Database?
Consider that we have a table which contains the contact details of employees who worked in a company. The table will contains fields like contact_no, dept_id, employee_design_role, date_work_started and date_work_ended. The following keys will be used in our table
- contact_no
- dept_id
- employee_design_role
- date_work_started
- date_work_ended
Consider that you want to select the list of employees with designation role as manager. Before opting for selectivity statements; we must mention the fields and query for the selectivity. Our SQL will be similar to the one given below.
FROM
contact list
INNER JOIN job table
ON
(
list.is_deleted = 0
AND
list.id = j.contact_no
AND
table.is_deleted = 0
AND
table.contact_no = #FORM.contact_id#
AND
table.employee_design_role = #ContactRoles.Manager#
AND
table.date_work_started <= NOW()
AND
(
table.date_work_ended IS NULL
OR
table.date_work_ended > NOW()
)
)
is_deleted is used in both sides of the join because database uses logical delete instead of physical. Let us assume that the table contains 100 contact_no and 150 jobs. The ratio between contact_no and jobs is 1:1.5. Now let’s exert our choices of cardinality and selectivity.
is_deleted
is_deleted is a boolean operation which returns either 1 or 0. The cardinality of this particular field is 2. Since we have around 100 contact_no; the selectivity rate of this particular field is 100/2 = 50. The sample size jobs is 150 whose selectivity rate is 150/2 = 75.
contact_no
contact_no is a foreign key to our particular table because the selectivity rate of this field will be similar to the ratio between contact_no and jobs. We have around 100 contact_no and number of sample size jobs is around 150. The cardinality of this field is 100 while the number of records present is 150 so the selectivity rate for this field is 150/100 = 1.5.
employee_design_role
Consider that there are 15 types of jobs in the company. Now the cardinality for employee_design_role is 15. This field will give us the selectivity rate of 150/15 = 10. The selectivity rate of employee_design_role is better than is_deleted but worse when compared to contact_no.
date_work_started
Let us assume that no jobs are allowed add in the table which will assign a date for every work started. The date set for every work is stored as a binary value which returns the cardinality of the date_work_started as 2. The selectivity rate of date_work_started is 150/2 = 75.
date_work_ended
Similar to the date_work_started; the date will return with binary value. If the work is still continuing it will return with NULL value. So the cardinality rate of date_work_ended is 3. The selectivity rate for date_work_ended is 150/3 = 3.
Based on the selectivity rate for the above fields; the joins in our SQL can be re-arranged like the code given below.
FROM
contact list
INNER JOIN
job table
ON
(
<!— Selectivity: 1. —>
c.no = #FORM.contact_no#
<!— Selectivity: 10. —>
AND
table.employee_design_role = #ContactRoles.Manager#
<!— Selectivity: 3. —>
AND
(
table.date_work_ended IS NULL
OR
table.date_work_ended > NOW()
)
<!— Selectivity: 50. —>
AND
list.is_deleted = 0
<!— Selectivity: 75. —>
AND
table.is_deleted = 0
<!— Selectivity: 75. —>
AND
table.date_work_started <= NOW()
<!— Join condition. —>
AND
list.no = table.contact_no
)
The above SQL code will offer you better advantage while indexing the table. If you are searching in a column which has 100% selectivity then it is better to use a query in that particular column instead using selectivity which prompts full table scan. Selectivity will be the best option while searching for data with repeated values.