Definitions

  • Control group: A subject that did not get treatment
  • Treatment group: A subject that received the treatment
  • Observational studies: The assignment of subjects into treatment and control groups is outside the control of investigators
    • E.g. To study the effect of smoking, investigators cannot which subject to be in the treatment group, they can’t force people to smoke.
    • An observational study is one which investigator cannot use randomisation for allocation to groups
  • Confounder: Is a variable that related to both independent(the variable you are studying) and dependent variable(the outcome you are trying to measure) and therefore can create a false association between them.
    • E.g. You are studying whether people exercise more are less likely to develop diseases. However, if you do not take into account the confounding variable of age, which is also related to both exercise and disease, you may falsely conclude that exercise is protective against disease.
    • Confounding can be caused by bias.
  • Bias: Something which effects the ability of the data to accurately measure the treatment effect.
    • Selection bias: introduced by the investigator when subject is more likely to be chosen than others
    • Survivor bias: caused by dropout of some subjects. E.g. an “improvement” is due to dropout the worst subject who do not respond to the treatment
    • Adherer bias: occurs when certain types of subjects keep taking the treatment (or placebo), as opposed to the non-adheres. subjects who adhere to the treatment/program tend to be more compliant and healthier
    • Non response bias: caused by participants who fail to complete surveys.
    • Interviewer’s bias: when the interviewer has to make a choice of participants in the survey, or when characteristics of the interviewer have an effect on the answer given by participants.
    • Measurement bias: when the form of the question in the survey affects the response to the question.
  • Placebo:Pretend treatment designed to be neutral and indistinguishable from the treatment. Which it has no actual affect.
    • E.g. Company is testing a new pain medication. In a randomized controlled trial, one group of participants would receive the new medication, while another group would receive a placebo pill that looks identical to the new medication but contains no active ingredients. The researchers would then compare the pain levels between the two groups to determine if the new medication is effective in reducing pain. The use of placebos is important in clinical research because it helps to minimize the influence of bias and other confounding factors on the results of a study.
  • Placebo effect: Occurs from the subject thinking they have had the treatment.
  • Simpson’s paradox: Is where there is a clear trend in individual data which disappears when the groups are pooled together.
    • E.g. A university has two departments A and B, the university has a higher admission rate for men than for women. However, we find that both departments have a higher admission rate for women than for men. This seems paradoxical because the overall trend appears to contradict the trend within each department.
    • This paradox occurs when there is a confounding variable. In the above example, the confounding variable could be the applicant’s major, which may be distributed differently between men and women in each department. When we split the data into subgroups based on department, the confounding variable (major) is more similar within each subgroup, causing the overall trend to disappear or reverse.
  • Initial Data Analysis (IDA): Is a general first look at the data, without formally answering the research question.
    • Helps to see whether the data can answer research questions
    • Ensures that the later statistical analysis can be performed efficiently and minimises the risk of incorrect or misleading results.
    • May pose other research questions
    • Can identify main components/qualities of data
    • Suggest the population from which a sample is derived
    • Involves:
      • Data background (checking quality and integrity)
      • Data structure (what information has been collected)
      • Data wrangling (scraping, cleaning, tidying, combining
      • Data summaries (graphical and numerical)
  • Big data: Is the massive amounts of data being collected in fields such as genomics, astrophysics etc.
    • Commonly high dimensional, which means that there are more variables P than subjects N.
    • High volume, high velocity, high variety, high variability, low veracity/validity, high vulnerability, high volatility and high value.

Back to parent page: Computational Data Science and Artificial Intelligence (AI)

Data_Science Data1001