Hur utvärderar man klinisk bildkvalitet med statistiska metoder? SK-kurs, Stra lningsfysik, teknik och stra lskydd 9 dec 2015 Örjan Smedby Skolan för teknik och hälsa (STH) Kungliga Tekniska Högskolan (KTH)
Sammanhang Diagnos Val av behandling Hälsoeffekt Behandlingseffekt Undersökning
Efficacy of Diagnostic Methods Level 1: Technical efficacy Bildkvalitet, upplösning, brus... Level 2: Diagnostic accuracy efficacy Hur ofta blir diagnosen rätt? Level 3: Diagnostic thinking efficacy Hur påverkas remittentens diagnostiska tänkande? Level 4: Therapeutic efficacy Hur påverkas valet av behandling? Level 5: Patient outcome efficacy Hur påverkas patientens hälsa? Level 6: Societal efficacy ytta och kostnader för samhället (Fryback DG, Thornbury JR. Med Decis Making 1991)
Image quality vs. diagnostic accuracy Evaluate entire diagnostic process Reliable ground truth? Evaluate physical quality parameters Physical measuring tools Classical statistical tools
Ett diagnostiskt test Pos test eg test Summa Sjuk 25 5 30 Frisk 15 105 120 Summa 40 110 150 Hur stor är chansen att en sjuk klassificeras rätt? Sensitivitet 25/30 = 83%
Ett diagnostiskt test Pos test eg test Summa Sjuk 25 5 30 Frisk 15 105 120 Summa 40 110 150 Hur stor är chansen att en frisk klassificeras rätt? Specificitet 105/120 = 88%
Ett diagnostiskt test Pos test eg test Summa Sjuk 25 5 30 Frisk 15 105 120 Summa 40 110 150 Hur stor är sannolikheten att en pat med pos test verkligen är sjuk? Positivt prediktionsvärde 25/40 = 63%
Ett diagnostiskt test Pos test eg test Summa Sjuk 25 5 30 Frisk 15 105 120 Summa 40 110 150 Hur stor är sannolikheten att en pat med neg test verkligen är frisk? egativt prediktionsvärde 105/110 = 95%
Tröskelnivå Högre gräns för patologi: - sensitiviteten sjunker - specificiteten ökar Lägre gräns för patologi: - sensitiviteten ökar - specificiteten sjunker sensitivitet specificitet
ROC kurva sensitivitet 1 Area under ROC curve (AUROC): 1 perfekt 0,5 värdelös 0 0 1 1 specificitet
Receiver operating characteristics Generalisering av sensitivitet och specificitet Hur ändras sens. och spec. när tröskeln ändras?
Exklusionströskel Sannolikheter Aktionströskel 0% 100% sannolikhet efter us (post-test probability) sannolikhet före us (pre-test probability) sannolikhet efter us (post-test probability) egativt fynd Positivt fynd Hur sannolikheten påverkas av ett pos resp neg fynd kan beräknas med likelihood ratios (LR+ och LR ), som beror av sens och spec. (Se Wikipediaartikeln Likelihood ratios in diagnostic testing )
Trösklar Trösklar för behandling eller expektans beror på konsekvenserna av resp beslut.
Receiver operating characteristics Generalisering av sensitivitet och specificitet Hur ändras sens. och spec. när tröskeln ändras? Kräver ett facit (gold standard) Kräver ett stort material Mycket arbete, stora kostnader
Image quality vs. diagnostic accuracy Evaluate entire diagnostic process Reliable ground truth ROC study Evaluate physical quality parameters Physical measuring tools Classical statistical tools
Image quality vs. diagnostic accuracy Evaluate entire diagnostic process Visual image quality concept Evaluate physical quality parameters Reliable ground truth ROC study Visual grading experiment? Physical measuring tools Classical statistical tools
Study types Single images Rate image A on a scale from 1 to 5 Image pairs Rate the difference between image A and B on a scale from 2 to +2
EUROPEA COMMISSIO Criteria & rating scale EUROPEA GUIDELIES O QUALITY CRITERIA FOR DIAGOSTIC RADIOGRAPHIC IMAGES EUR 16260 E Let alone a MAGRITTE
European guidelines on quality criteria Typical: visibility of an anatomical structure Visually sharp reproduction of the intervertebral joints 5. Criterion is fulfilled 4. Criterion is probably fulfilled 3. Indecisive Criteria & rating scale 2. Criterion is probably not fulfilled 1. Criterion is not fulfilled
European guidelines on quality criteria Typical: visibility of an anatomical structure Visually sharp reproduction of the intervertebral joints +2: Criterion is better fulfilled in right image +1: Criterion is probably better fulfilled in right image 0: Indecisive Criteria & rating scale 1: Criterion is probably better fulfilled in left image 2: Criterion is better fulfilled in left image
Situation Patient Imaging Postprocessing Observer score P1 Im1 PP1 O1 P2 Im2 PP2 O2 P3 P4... PP3 O3...
Types of data Interval: numerical, continuous Ordinal: ordered categories ominal: I O individual categories, no order Measurement 10 20 30 40 Rating score 1 2 3 4 5 Persons A B C D
Visual grading characteristics (VGC) (Båth & Månsson BJR 2007) För varje kvalitetsnivå: Hur stor andel uppfyller kravet med metod A resp. metod B? 5 Metod A Metod B 0.00 0.00 0.05 0.20 3 4 5 2 3 4 5 4 5 3 4 5 0.20 0.50 0.50 0.80 4 5 2 3 4 5 1 2 3 4 5 0.80 0.95 1.00 1.00 5
Statistical model Patient Im1 Im2 Im3 10 20 30 40 Imaging system? I Settings Postprocessing O score 1 2 3 4 5 Observer
Statistical model Patient I Imaging system Settings Postprocessing ordinal logistic regression O score Observer
Logistic regression Logit function logit (p) = log (p/(1 p)) Regression equation logit (p) = ax + b p = 1/(1 + exp( ax + b))
Ordinal logistic regression Logit function logit (p) = log (p/(1 p)) Regression equation logit (p) = ax + b p = 1/(1 + exp(ax + b)) VGR model logit (P(y n)) = a 1 Im1 +a 2 Im2 + b 1 PP1 +b 2 PP2 +b 3 PP3 + D P +E O C n (Smedby & Fredrikson, British Journal of Radiology 2010)
Statistical model Patient Im1 Im2 Im3 PP1 PP2 I Imaging system Settings Postprocessing ordinal logistic regression (VGR) O score Observer
Statistical model random effect Patient Im1 Im2 Im3 fixed effect PP1 PP2 fixed effect I Imaging system Settings Postprocessing ordinal logistic regression (VGR) O score Observer random effect
Empirical data (De Geer 2011) Coronary CTA 24 patients (P1 P24) Standard (310 mas Ref) and reduced dose (62 mas Ref) Reduced-dose images post-processed with 2D adaptive filter (Sharpview) Filtered and unfiltered reduced-dose images viewed by 9 radiologists (R1 R9)
Criteria Criterion 1: Visually sharp reproduction of the thoracic aorta. Criterion 2: Visually sharp reproduction of the wall of the thoracic aorta. Criterion 3: Visually sharp reproduction of the heart. Criterion 4: Visually sharp reproduction of the left main coronary artery (LMA). Criterion 5: The image noise in relevant regions is sufficiently low for diagnosis.
Rating scale 1. Criterion is fulfilled 2. Criterion is probably fulfilled 3. Indecisive 4. Criterion is probably not fulfilled 5. Criterion is not fulfilled
Statistical model Postprocessing unfiltered filterered Patient ordinal logistic regression O score Observer
Results: filter effect Criterion 1: Visually sharp reproduction of the thoracic aorta 2: Visually sharp reproduction of the aortic wall Ordinal logistic regression regression coefficient p value 0.53 0.0036 0.90 <0.000001 3: Visually sharp reproduction of the heart 0.81 0.00005 4: Visually sharp reproduction of the LMA 0.78 0.000004 5: oise sufficiently low for diagnosis 0.96 <0.000001
Including mas effect Both standard-dose and reduced-dose images were viewed, reduced-dose images with and without filtering
Statistical model with mas Patient 62 310 Postprocessing unfiltered filterered I log mas setting ordinal logistic regression O score Observer
Statistical model with mas etc. I Weight Patient 62 310 Postprocessing unfiltered filterered I log mas setting ordinal logistic regression O score Observer O Education
Dose reduction 1.0 Criterion 3 Probability of a score of 1 or 2 0.8 0.6 0.4 0.2 0.0 0 86 115 150 300 mas Ref setting Unfiltered Filtered
Results with mas Regression coefficients adaptive Criterion log (mas) filter 1: Visually sharp reproduction of the thoracic aorta 2.52 0.45 2: Visually sharp reproduction of the aortic wall 2.53 0.75 3: Visually sharp reproduction of the heart 2.54 0.74 4: Visually sharp reproduction of the LMA 2.52 0.61 5: oise sufficiently low for diagnosis 2.74 0.77
Results with mas Regression coefficients Estimated adaptive mas Criterion log (mas) filter reduction 1: Visually sharp reproduction of the thoracic aorta 2.52 0.45 16% 2: Visually sharp reproduction of the aortic wall 2.53 0.75 26% 3: Visually sharp reproduction of the heart 2.54 0.74 25% 4: Visually sharp reproduction of the LMA 2.52 0.61 21% 5: oise sufficiently low for diagnosis 2.74 0.77 24%
Study II Abdominal CT (Philips Mx8000IDT) Standard dose (180 mas; CTDI vol =12 mgy) vs. reduced dose (90 mas; CTDI vol =6 mgy) vs. reduced dose with 2D filtering vs. reduced dose ormal with dose 3D filtering Low dose 12 patients, 6 observers Image-pair viewing 5 image quality criteria, judged on a 5-level scale ( 2 +2) Low dose 2D filtered Low dose 3D filtered
Visual grading scores Criterion 1: Delineation of pancreas 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 2 1 0-1 -2 0% ormal dose Low dose 2D filter 3D filter Early phase Late phase Frequency of favourable (+) vs. unfavourable ( ) scores for each image type
Potential for dose reduction Image quality criterion Equivalent mas, 2D filter Equivalent mas, 3D filter Dose reduction, 2D filter Dose reduction, 3D filter Criterion 1: Delineation of pancreas 112 103 38% 43% Criterion 2: Delineation of veins in liver 120 102 33% 43% Criterion 3: Delineation of common bile duct 114 102 37% 43% Criterion 4: Image noise 106 88 41% 51% Criterion 5: Overall diagnostic acceptability 117 102 35% 43% Predicted mas settings that after filtering would yield image quality equivalent to normal dose (180 mas).
Conclusion For analyzing diagnostic accuracy, ROC studies are superior, but costly and cumbersome. Visual grading experiments describe visual image quality. Simple comparisons can be made with VGC. Ordinal logistic regression (VGR) makes it possible to obtain direct numeric estimates of the potential for dose reduction. Particularly useful when testing and optimising acquisition/post-processing protocols.