From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on ip-172-31-74-118.ec2.internal X-Spam-Level: X-Spam-Status: No, score=-1.9 required=3.0 tests=BAYES_00,FREEMAIL_FROM autolearn=ham autolearn_force=no version=3.4.6 X-Received: by 2002:a05:620a:430e:: with SMTP id u14mr15111425qko.286.1640696093624; Tue, 28 Dec 2021 04:54:53 -0800 (PST) X-Received: by 2002:a25:6884:: with SMTP id d126mr73431ybc.355.1640696091917; Tue, 28 Dec 2021 04:54:51 -0800 (PST) Path: eternal-september.org!reader02.eternal-september.org!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail Newsgroups: comp.lang.ada Date: Tue, 28 Dec 2021 04:54:51 -0800 (PST) In-Reply-To: <875d209a-9504-4cdb-86cd-ce9b220a4a92n@googlegroups.com> Injection-Info: google-groups.googlegroups.com; posting-host=213.166.55.173; posting-account=sDyr7QoAAAA7hiaifqt-gaKY2K7OZ8RQ NNTP-Posting-Host: 213.166.55.173 References: <7bede061-4b0f-4029-beb1-1056637e57d6n@googlegroups.com> <49538254-21ed-4fd0-8316-1bccc7d3c635n@googlegroups.com> <87sfue8a0v.fsf@bsb.me.uk> <31332c61-a370-43a5-bbe0-efe338ee6d8fn@googlegroups.com> <87fsqd7oz8.fsf@bsb.me.uk> <87y2456071.fsf@bsb.me.uk> <7f50b560-9d28-4572-a90c-7488fb27582en@googlegroups.com> <875d209a-9504-4cdb-86cd-ce9b220a4a92n@googlegroups.com> User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: Subject: Re: Some advice required [OT] From: Laurent Injection-Date: Tue, 28 Dec 2021 12:54:53 +0000 Content-Type: text/plain; charset="UTF-8" Xref: reader02.eternal-september.org comp.lang.ada:63296 List-Id: On Tuesday, 28 December 2021 at 10:05:50 UTC+1, Laurent wrote: > On Tuesday, 28 December 2021 at 08:48:32 UTC+1, Laurent wrote: > > On Tuesday, 28 December 2021 at 01:29:57 UTC+1, Ben Bacarisse wrote: > > > Laurent writes: > > > > > > > On Monday, 27 December 2021 at 21:49:18 UTC+1, Ben Bacarisse wrote: > > > >> Laurent writes: > > > >> > > > >> > On Monday, 27 December 2021 at 14:14:42 UTC+1, Ben Bacarisse wrote: > > > >> >> Laurent writes: > > > >> >> > > > >> >> > On Monday, 27 December 2021 at 12:16:27 UTC+1, Niklas Holsti wrote: > > > >> >> > > > > >> >> >> Sorry, but I found your problem description impossible to understand. > > > >> >> >> Try to describe more clearly the experiment that is done, the structure > > > >> >> >> of the data the experiment provides (the meaning of the Excel rows and > > > >> >> >> columns), and the statistic you want to compute. > > > >> >> > > > > >> >> > Sorry tried to keep it short, was too short. > > > >> >> > > > > >> >> > Columns are the antimicrobial drugs > > > >> >> > Rows are the microorganism. > > > >> >> > > > > >> >> > So every cell contains a result of S, I, R or simply an empty cell > > > >> >> > > > > >> >> > S = Sensible > > > >> >> > I = Intermediate > > > >> >> > R = Resistant > > > >> >> > > > > >> >> > empty cell > > >> >> > > > > >> >> > If a patient has 3 strains of the same microorganism but with > > > >> >> > different resistance profiles I have to find the most resistant > > > >> >> > one. Or if they are different I keep them all. > > > >> >> > > > > >> >> > I have no idea how to explain what I am doing to the compiler. > > > >> >> I think when you can explain it to people, you'll be able to code it. I > > > >> >> am still struggling to understand what you need. > > > >> >> > Why I would choose result from strain B over the result from strain A. > > > >> >> > > > > >> >> > strain A: SSSRSS > > > >> >> > strain B: SSRRRS > > > >> >> Let's space it out > > > >> >> > > > >> >> drug 1 drug 2 drug 3 drug 4 drug 5 drug 6 > > > >> >> strain A S S S R S S > > > >> >> strain B S S R R R S > > > >> >> > > > >> >> You want to choose B because it has is resistant to more drugs, yes? > > > >> >> > > > >> > > > > >> > Yes indeed > > > >> > > > > >> >> I think, from the ordering you give, you need a measure that treats an R > > > >> >> as "more important" that any "I" which is "more important" than an "S". > > > >> >> (We will come to empty cells later.) > > > >> >> > > > >> >> I think you need to treat the number of Rs, Is and Ss like digits in a > > > >> >> number. In base 10, the strains score > > > >> >> > > > >> >> R S I > > > >> >> strain A 1 5 0 = 150 > > > >> >> strain B 3 3 0 = 330 > > > >> >> > > > >> >> Now, in fact, you don't need to use base 10. The smallest base you can > > > >> >> use is one more than the maximum number of test results. If there can > > > >> >> be up to 16 tests (say) the score is > > > >> >> > > > >> >> n(R)*17*17 + n(S)*17 + n(I). > > > >> >> > > > >> >> If this suits your needs, we can consider empty cells later on. It's > > > >> >> not at all clear to me how to compare > > > >> >> > > > >> >> strain C R____ > > > >> >> strain D RRSSSS > > > >> >> > > > >> >> Strain C is "less resistant" but only because there is not enough > > > >> >> information. In fact it seems more serious as it is resistant to all > > > >> >> tested drugs. > > > >> >> > > > >> > > > > >> > Strain C is probably garbage and I would remove it. With a bit of luck I will have the result with the same sample Id which would be complete. > > > >> > > > > >> >> And then what about > > > >> >> > > > >> >> strain D SR > > > >> >> strain E RS > > > >> >> > > > >> > > > > >> > Yes those are the cases which are annoying me. > > > >> > > > > >> > That's why I came up withe idea of multiplying the value of the result > > > >> > (S=1, I=2 and R=3) with the position of the value. Tried it with > > > >> > triplets but there will still be cases where different results will > > > >> > give the same numeric value. Ignoring empty cell able tps for the moment. > > > >> > > > > >> > Strain F: SSR (1*1+2*1+3*3) =12 and Strain G: RRS (1*3+ 2*3+3*1) = 12 > > > >> > will be the same numerical value but they are different resistance > > > >> > profiles I would in this case keep both. > > > >> > > > > >> > How to prevent that from happening. > > > >> Can you first say why the suggestion I made is not helpful? > > > >> > > > >> -- > > > >> Ben. > > > > > > > > You mean that one: > > > > > > > >> >> I think you need to treat the number of Rs, Is and Ss like digits in a > > > >> >> number. In base 10, the strains score > > > >> >> > > > >> >> R S I > > > >> >> strain A 1 5 0 = 150 > > > >> >> strain B 3 3 0 = 330 > > > >> >> > > > > > > > > Different resistance profiles same result: > > > I don't yet understand the requirements so I am taking it in stages. > > > The first requirement seemed to be "more or less resistant". To do that > > > you can use digits in a large enough base but this will make the number > > > of Rs, Ss and Is paramount. Is that acceptable as a first step? > > > > > The requirements are one strain of a certain microorganism/patient > > The most resistant one or if they have different profiles > > > > SRS vs RRS => last one, more Rs > > > > SRS vs RSR = both, different profiles > > > In order to help people to be able to make further suggestions, maybe > > > you could give the relative ordering you would like to see between the > > > following sets of profiles. For example, between SSR, SRS and RSS, I > > > think the order you want is RSS > SRS > SSR. > > > > > > 1: SSR, SRS, RSS > > > > > > 2: RSI, RIS, SRI, SIR, IRS, ISR > > > > > > 3: SSSR, SSRS, SRSS, RSSS > > > > > > 4: RRSSS, RSSSR, RIIII, SRIII, RSIII, IIIRS, IIISR > > > > > The order of the results is given by the ID of the drug in the extraction tool. > > I could probably order them by family and hierarchy of potence but > > would that make a difference? > > > It's possible you could make do with an extra field (or digits) that > > > gives some measure of the relative ordering between otherwise similar > > > sequences. For example, using base 10 (for convenience of arithmetic) > > > both RRSSI and RSRSI would score 212xx but the last xx would reflect the > > > positioning of the results in the sequence. There are lots of way to do > > > this. One way would be use, as you were thinking, some sort of weighted > > > count. Using S=0, I=1 and R=2 with weights > > > > > > 54321 > > > RRSSI scores 2*10000 + 1*1000 + 2*100 + 2*(5+4) + 0*(3+2) + 1*1 = 21219 > > > RSRSI scores 2*10000 + 1*1000 + 2*100 + 2*(5+3) + 0*(4+2) + 1*1 = 21217 > > > > > So to be sure that I am following: > > > > 2*(5+4) = value of R (=2) * position of R(@5 and @4) > > 2*(5+3) = value of R (=2) * position of R(@5 and @3) > > > > 0*(3+2) = value of S (=0) * position of S(@3 and @2) > > 0*(4+2) = value of S (=0) * position of S(@4 and @2) > > > > 1*1 = value of I (=1) * position of I (@1) > > > > 2*10000 + 1*1000 + 2*100 Is just used as padding? So 212 could be any other > > number? > > > Eh forget the last sentence, brain fart: I have 2 R's so 2*10000, 1 I so 1*1000 and 2 S's so 2*100 > > But in this example I would have to keep both as drug 5,2 and 1 are common > > to both results but 4 and 3 are unique. > > > > The score would be completely misleading. > > > > So if my table has a width of 20 columns the first column would be > > 10^20, the next 10^19,.... +/- a few 0s off? > > > > I would have to implement it and see what I get as result. > > > If you absolutely must never get duplicate numbers, but you still want > > > to preserve a strict specified ordering, I think you will have much more > > > work to do. > > > > > > Getting a unique number for each case it trivial (but the ordering will > > > be wrong) and getting an ordering that rates every R > every S > every I > > > is also trivial, but there will be lots of duplicates. It's finding the > > > balance that's going to be hard. > > > > > > -- > > > Ben. > > I have prepared a cleaned up Excel workbook with only the duplicates which > > pose problems. The ones I would keep have an orange ID. > > I could upload it to Github. If that helps understanding the different cases. > > > > Thanks for your patience > > > > Laurent Ben, I have implemented your solution but I don't understand the reason why S would have a value of 0? I then don't need to take care of the S'es because the result will always be 0. Not that it changes a lot Because I still couldn't choose the profile of interest only based on the numbers. R R S S I Ben's Solution: 212 11 Mine: 212 1205 R S R S I 212 13 212 1405 R R R S I 311 17 311 1805 R S R R I 311 21 311 1407 S R R R I 311 23 311 1607 311 17 and 311 23 being the most likely but unclear where the difference might be. I have adapted my current solution to include the number of R,I,S weight of the results: S=1, I=2, R=3 weight of the position in the triplet: 1st=1, 2nd=2, 3rd=3 ie.: R R R => First triplet: 1*3+2*3+3*3 = 18 S I => Second triplet 1*1+2*2 = 05 RIS count: 311 Append 1st triplet: 311 18 Append 2nd triplet: 311 18 05 311 18 05 and 311 16 07 being the most likely with some clues which triplet is different. Am I not somehow introducing a bias by multiplying the value with the position in the triplet? And then there is still the case where SSR (1*1+2*1+3*3=12) and RRS (1*3+2*3+3*1=12) will both resolve to the same value. With 5 values it looks easy but with 20 I am getting headaches. I don't even know if the triplet idea is good. Got inspired by some old microorganism identification cards which put 3 test results into one digit to get a more compact identification profile. Wouldn't I need some sort of Traveling Salesman Problems algorithm to find the profile with the highest number of resistances and the highest number of triplets with high values. Thanks Laurent