Re: Some advice required [OT]

comp.lang.ada
 help / color / mirror / Atom feed

From: Ben Bacarisse <ben.usenet@bsb.me.uk>
Subject: Re: Some advice required [OT]
Date: Tue, 28 Dec 2021 13:43:21 +0000	[thread overview]
Message-ID: <87sfuc975y.fsf@bsb.me.uk> (raw)
In-Reply-To: 7f50b560-9d28-4572-a90c-7488fb27582en@googlegroups.com

Laurent <lutgenl@icloud.com> writes:

> On Tuesday, 28 December 2021 at 01:29:57 UTC+1, Ben Bacarisse wrote:
>> Laurent <lut...@icloud.com> writes: 
>> 
>> > On Monday, 27 December 2021 at 21:49:18 UTC+1, Ben Bacarisse wrote: 
>> >> Laurent <lut...@icloud.com> writes: 
>> >> 
>> >> > On Monday, 27 December 2021 at 14:14:42 UTC+1, Ben Bacarisse wrote: 
>> >> >> Laurent <lut...@icloud.com> writes: 
>> >> >> 
>> >> >> > On Monday, 27 December 2021 at 12:16:27 UTC+1, Niklas Holsti wrote: 
>> >> >> > 
>> >> >> >> Sorry, but I found your problem description impossible to understand. 
>> >> >> >> Try to describe more clearly the experiment that is done, the structure 
>> >> >> >> of the data the experiment provides (the meaning of the Excel rows and 
>> >> >> >> columns), and the statistic you want to compute. 
>> >> >> > 
>> >> >> > Sorry tried to keep it short, was too short. 
>> >> >> > 
>> >> >> > Columns are the antimicrobial drugs 
>> >> >> > Rows are the microorganism. 
>> >> >> > 
>> >> >> > So every cell contains a result of S, I, R or simply an empty cell 
>> >> >> > 
>> >> >> > S = Sensible 
>> >> >> > I = Intermediate 
>> >> >> > R = Resistant 
>> >> >> > 
>> >> >> > empty cell <S<I<R 
>> >> >> > 
>> >> >> > If a patient has 3 strains of the same microorganism but with 
>> >> >> > different resistance profiles I have to find the most resistant 
>> >> >> > one. Or if they are different I keep them all. 
>> >> >> > 
>> >> >> > I have no idea how to explain what I am doing to the compiler. 
>> >> >> I think when you can explain it to people, you'll be able to code it. I 
>> >> >> am still struggling to understand what you need. 
>> >> >> > Why I would choose result from strain B over the result from strain A. 
>> >> >> > 
>> >> >> > strain A: SSSRSS 
>> >> >> > strain B: SSRRRS 
>> >> >> Let's space it out 
>> >> >> 
>> >> >> drug 1 drug 2 drug 3 drug 4 drug 5 drug 6 
>> >> >> strain A S S S R S S 
>> >> >> strain B S S R R R S 
>> >> >> 
>> >> >> You want to choose B because it has is resistant to more drugs, yes? 
>> >> >> 
>> >> > 
>> >> > Yes indeed 
>> >> > 
>> >> >> I think, from the ordering you give, you need a measure that treats an R 
>> >> >> as "more important" that any "I" which is "more important" than an "S". 
>> >> >> (We will come to empty cells later.) 
>> >> >> 
>> >> >> I think you need to treat the number of Rs, Is and Ss like digits in a 
>> >> >> number. In base 10, the strains score 
>> >> >> 
>> >> >> R S I 
>> >> >> strain A 1 5 0 = 150 
>> >> >> strain B 3 3 0 = 330 
>> >> >> 
>> >> >> Now, in fact, you don't need to use base 10. The smallest base you can 
>> >> >> use is one more than the maximum number of test results. If there can 
>> >> >> be up to 16 tests (say) the score is 
>> >> >> 
>> >> >> n(R)*17*17 + n(S)*17 + n(I). 
>> >> >> 
>> >> >> If this suits your needs, we can consider empty cells later on. It's 
>> >> >> not at all clear to me how to compare 
>> >> >> 
>> >> >> strain C R____ 
>> >> >> strain D RRSSSS 
>> >> >> 
>> >> >> Strain C is "less resistant" but only because there is not enough 
>> >> >> information. In fact it seems more serious as it is resistant to all 
>> >> >> tested drugs. 
>> >> >> 
>> >> > 
>> >> > Strain C is probably garbage and I would remove it. With a bit of luck I will have the result with the same sample Id which would be complete. 
>> >> > 
>> >> >> And then what about 
>> >> >> 
>> >> >> strain D SR 
>> >> >> strain E RS 
>> >> >> 
>> >> > 
>> >> > Yes those are the cases which are annoying me. 
>> >> > 
>> >> > That's why I came up withe idea of multiplying the value of the result 
>> >> > (S=1, I=2 and R=3) with the position of the value. Tried it with 
>> >> > triplets but there will still be cases where different results will 
>> >> > give the same numeric value. Ignoring empty cell able tps for the moment. 
>> >> > 
>> >> > Strain F: SSR (1*1+2*1+3*3) =12 and Strain G: RRS (1*3+ 2*3+3*1) = 12 
>> >> > will be the same numerical value but they are different resistance 
>> >> > profiles I would in this case keep both. 
>> >> > 
>> >> > How to prevent that from happening. 
>> >> Can you first say why the suggestion I made is not helpful? 
>> >> 
>> >> -- 
>> >> Ben. 
>> > 
>> > You mean that one: 
>> > 
>> >> >> I think you need to treat the number of Rs, Is and Ss like digits in a 
>> >> >> number. In base 10, the strains score 
>> >> >> 
>> >> >> R S I 
>> >> >> strain A 1 5 0 = 150 
>> >> >> strain B 3 3 0 = 330 
>> >> >> 
>> > 
>> > Different resistance profiles same result:
>>
>> I don't yet understand the requirements so I am taking it in stages. 
>> The first requirement seemed to be "more or less resistant". To do that 
>> you can use digits in a large enough base but this will make the number 
>> of Rs, Ss and Is paramount. Is that acceptable as a first step?  
>
> The requirements are one strain of a certain microorganism/patient
> The most resistant one or if they have different profiles
>
> SRS vs RRS => last one, more Rs
>
> SRS vs RSR = both, different profiles

I think this is a "yes" to my question.  The trouble is you speak in the
subject domain (as one would expect) but I have to speak in the computer
science domain, because that's all I know.

You speak of giving profiles a score.  To me, that mean giving a
profile some numeric value (actually it need not be numeric, but let's
stick with numbers for the moment).  The score orders the profiles --
some score high (= very resistant) and some score lower.

>> In order to help people to be able to make further suggestions, maybe 
>> you could give the relative ordering you would like to see between the 
>> following sets of profiles. For example, between SSR, SRS and RSS, I 
>> think the order you want is RSS > SRS > SSR. 
>> 
>> 1: SSR, SRS, RSS 
>> 
>> 2: RSI, RIS, SRI, SIR, IRS, ISR 
>> 
>> 3: SSSR, SSRS, SRSS, RSSS 
>> 
>> 4: RRSSS, RSSSR, RIIII, SRIII, RSIII, IIIRS, IIISR 
>
> The order of the results is given by the ID of the drug in the extraction tool.
> I could probably order them by family and hierarchy of potence but 
> would that make a difference?

I am referring to the order you want the score to produce.  You want, I
think, the score for a profile with more Rs to be higher than any score
for a profile with fewer.  Using x for an S or an I or a missing result,
you want all of

  RRRxxx  xRRxRx  xxRxRR

and so on to score higher than any of

  RRxxxx  RxxRxx  xxRxRx

Three Rs beats two R no matter where they are.  Similarly, when the
number of Rs is the same, you want a profile with more Is to "beat"
(score higher) than any profile with fewer.

There is a standard way to do this which can result in a pure number,
but you can also think of it as a short sequence of numbers (three in
this case) where the first is more important than the second, which is
more important than the third.

So IIISR is given the sequence (1,1,3) (1 R, 1 S and 3 Is).  As a base
10 number, that's 113.  In base 100 it's 10103.  Bigger bases allow one
to separate larger counts.

Now there is also a secondary ordering.  When the number of Rs, Is and
Ss is the same, I think you wanted to consider some test results as more
important.  To do that, I am suggesting adding another number to the
sequence or another digit or two if you like.

>> It's possible you could make do with an extra field (or digits) that 
>> gives some measure of the relative ordering between otherwise similar 
>> sequences. For example, using base 10 (for convenience of arithmetic) 
>> both RRSSI and RSRSI would score 212xx but the last xx would reflect the 
>> positioning of the results in the sequence. There are lots of way to do 
>> this. One way would be use, as you were thinking, some sort of weighted 
>> count. Using S=0, I=1 and R=2 with weights 
>> 
>> 54321 
>> RRSSI scores 2*10000 + 1*1000 + 2*100 + 2*(5+4) + 0*(3+2) + 1*1 = 21219 
>> RSRSI scores 2*10000 + 1*1000 + 2*100 + 2*(5+3) + 0*(4+2) + 1*1 = 21217 
>> 
>
> So to be sure that  I am following:
>
> 2*(5+4) = value of R (=2) * position of R(@5 and @4)
> 2*(5+3) = value of R (=2) * position of R(@5 and @3)
>
> 0*(3+2) = value of S (=0) * position of S(@3 and @2)
> 0*(4+2) = value of S (=0) * position of S(@4 and @2)
>
> 1*1 = value of I (=1) * position of I (@1)

Yes.

> 2*10000 + 1*1000 + 2*100 Is just used as padding? So 212 could be any other
> number?

No, the 212 reflects the counts or Rs, Is and Ss.  It's up the high
digits so that these counts trump everything else in the score.

> But in this example I would have to keep both as drug 5,2 and 1 are common
> to both results but 4 and 3 are unique.

Ah, more domain specific constraints are coming in.  I think I will have
to duck out of this thread soon.

> The score would be completely misleading.
>
> So if my table has a width of 20 columns the first column would be
> 10^20, the next 10^19,.... +/- a few 0s off?

No.  If you have twenty columns, you need to use a base of at least 21
because the R, I and S counts could be as high as 20.  For convenience,
use 100 and leave two base-100 digits spare "at the bottom" for the "all
counts being equal" differential score:

  nRs nIs nSs 00 00

10 decimal digits in all.  But see below for another option.

> I would have to implement it and see what I get as result.

To get better answers, say what it is about the results that you don't
like.  If you can't say what you do want, saying what you don't want is
the next best thing.

> I have prepared a cleaned up Excel workbook with only the duplicates which
> pose problems. The ones I would keep have an orange ID.
> I could upload it to Github. If that helps understanding the different
> cases.

Probably.  I may have the wrong end of the stick altogether because you
worry about duplicates but talk about scores suggesting better or
worse.  There is nothing logically wrong with a duplicate score.

You can, very simply, assign every set of results a unique number.  Just
replace R with 2, I with 1 and S with 0 and concatenate the results to
treat it as number:

  IIISR = 11102

If you need the R, I and S counts to "dominate", prepend this number
with the counts (using, say, 2 digits each):

  IIISR = 1030111102  (1 R, 3 Is, 1 S and the number from the seqeunce)

Of course you can make the numbers smaller by using smaller bases, but I
can't say if this produces the kind of score that you'd find useful.

-- 
Ben.

next prev parent reply	other threads:[~2021-12-28 13:43 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-27  9:21 Some advice required [OT] Laurent
2021-12-27 11:16 ` Niklas Holsti
2021-12-27 12:29   ` Laurent
2021-12-27 13:14     ` Ben Bacarisse
2021-12-27 18:24       ` Laurent
2021-12-27 19:51         ` Dennis Lee Bieber
2021-12-27 20:49         ` Ben Bacarisse
2021-12-27 22:09           ` Laurent
2021-12-28  0:29             ` Ben Bacarisse
2021-12-28  7:48               ` Laurent
2021-12-28  9:05                 ` Laurent
2021-12-28 12:54                   ` Laurent
2021-12-28 13:57                     ` Ben Bacarisse
2021-12-28 18:19                       ` Laurent
2021-12-28 13:43                 ` Ben Bacarisse [this message]
2021-12-28 16:49                 ` Dennis Lee Bieber
2021-12-29  4:20                   ` Randy Brukardt
2021-12-27 17:41     ` Dennis Lee Bieber
2021-12-27 18:56       ` Niklas Holsti
2021-12-27 19:44         ` Laurent
2021-12-28  2:10     ` Randy Brukardt
2021-12-28  6:02       ` Laurent
2021-12-29  3:58         ` Randy Brukardt
2021-12-27 17:18 ` Simon Wright
2021-12-27 18:30   ` Laurent

replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox